TUM AI Lecture Series 2022:

Table of Contents

  1. Generation & GANs
  2. Autonomous Driving
  3. Image-based Rendering
  4. Self-Supervised Learning
  5. SLAM & Robotics
  6. Language
  7. AR/VR/MR
  8. 3D Objects
  9. Others

1. Generation & GANs

New Generative Models for Images, Landscape Videos and 3D Human Avatars(Victor Lempitsky) 2021/02.

  • StyleGAN for Landscape Videos: DeepLandscape.
    • network feature : duplicted latents - two upsampling structures (one small one large).
    • discriminator : unary (use the smaller one) & pairwise (use both). warp noise maps by homography transformations.
  • StyleGAN for 3D Human Avatars. SMPL-X

Controllable Content Generation without Direct Supervision (Niloy Mitra) 2020/12, Smart Geometry Processing Group. Adobe.

2. Autonomous Driving

Here for my paper read.

A Future With Self-Driving Vehicles (Raquel Urtasun) 2021/02.

Autonomy:

  • we want a system : Trainable end-to-end & Interpretable for Validation.
    • End-to-end Approaches. Direct, but not interpretable.
    • Autonomy Stack.
      • HD Maps /Sensors -> Perception -> Prediction -> Planning -> Control.
      • Interpretable, very bad productivity.
  • Joint Perception + Prediction :
  • Joint Perception + Prediction + Planning : Uber ATG Vision: Interpretable Neural Motion Planer
    1. Neural Motion Planer 2019, add a branch from the network as planner -> time & egocar position.
    2. DSDNet 2020. (1) multi-modal socially-consistent uncertainty; (2) explicitly condition on prediction; (3) use prior (human) knowledge.
    3. P3: Safe Motion Planning Through Interpretable Semantic Representations. Recurrent semantic occupancy map -> to avoid occupied regions.

Simulation: Structured Testing, Real World Replay, Sensor Simulation.

  • Lidar simulation : TrafficSim 2021, use real world data (real 3D Assets) to generate preception & prediction data.
  • Camera (multi-camera video) simulation : GeoSim 2021, use real world data to generate (through multi-view mulit-sensor reconstruction network).

3. Image-based Rendering

Neural Implicit Representations for 3D Vision (Andreas Geiger) 2020/09. cvpr talk pdf.

  • 3d representations:
    • Direct representation : voxels, points, meshes.
    • Implicit representation : decision boundary of a non-linear classifier.
  • Occupancy Network : $L(\theta, \phi) = \sum_{j=1}^{K}BCE(f_{\theta}(p_{ij}, z_{i}), o_{ij}) + KL[q_{\phi}(z|(p_{ij}, o_{ij}))|p_{0}(Z)]$.
  • Differentiable Volumetric Rendering 2020: 3d points + encoded image vector -> occupancy and color (for all points).
    • forward pass (rendering) : find surface point along the pixel ray, and get color.
    • backward pass : gradient based on color difference from pixel re-projection.
  • NERF: integrate all the points in the ray to get color and depth. (while Occupancy Network used only the occupied one)
    • GRAF 2020 predict without camera poses. sample rays (patch) and use discriminator.
  • Convolutional Occupancy Networks 2020, uses 3d feature volume.

Reconstructing the Plenoptic Function (Noah Snavely) 2020/10, Notes.

Understanding and Extending Neural Radiance Fields (Jonathan T. Barron) 2022/10, Jonathan T. Barron. See more in My Neural Rendering Page, My Deep Learning 3D Reconstruction Page.

Learning to Retime People in Videos (Tali Dekel) 2020/10

  • Analyzing, Visualizing and Re-rendering people in videos.
    • motion visualization, depth prediction, SpeedNet : adaptive speed up video
  • Change the speed of individual people within frames. Layered Neural Rendering for Retiming People in Video. interesting work!
    • Key challenges : space-time correlations; occlusions/dis-occlusions.
    • Layered Decomposition, then we can edit the video by changing the layers.

Reflections on Image-Based Rendering (Richard Szeliski) 2021/01. A overview.

Neural Fields Beyond Novel View Synthesis (Andrea Tagliasacchi) 2023/01: View understanding, Camera knowledge, Overfitting regime. NeRF : geometry + appearance.

4. Self-Supervised Learning

On Removing Supervision from Contrastive Self-Supervised Learning 2021/01 by Alexei Efros. Self-Supervised Learning (use the tools of supervised learning, but with raw data instead of human-provided labels):

  • Self-Supervised Learning Allow to get away from top-down (semantic) categorization. (jump out of concrete objects, to reach IDEE of Plato)
  • Self-Supervised Learning Enable continuous life-long learning.
    • we never see the same ‘training data’ in real life. Data augmentation encourage memorizing. -> Online Continual Learning. keep using new data to train.
    • Test-Time Training 2020, use self-supervised to adapt new data.
    • 实践是交互性的,机器要想更像人就也需要实践,那么仅仅单向地给它数据肯定是不够的,需要它以一种方式和客体发生作用才行。而且这种作用不能只是机械的,而且需要有“能动性”。

Learning Representations and Geometry from Unlabeled Videos (Andrea Vedaldi) 2021/01. horizontal problems, vertical problems. Contrastive Learning : vector representations.

5. SLAM & Robotics

New Methods for Reconstruction and Neural Rendering (Christian Theobalt) 2020/11

  • Monocular reconstruction : human hand, human skeleton, human performance (surface), 3d face.
  • Nerf : Deep relightable texture. StyleRig -> pose & light.
  • Neural Sparse Voxel Fields 2020.

Pushing Factor Graphs beyond SLAM (Frank Dellaert) 2020/12, GTSAM. Factor Graph Introduction. user case : Skydio drone, navigation, tracking and motion planning.

  • SLAM & GTSAM. Sparse Hessian Matrix - Bayes Tree : Incremental & Distributed (sub-trees).
  • Structure from Motion. GTSFM (it is really a nice work.), parallelize SFM over large clusters, using DASK.
  • Navigation and Control. IMU-preintegration factor is integrated inside GTSAM.
  • More.

Sights, Sounds, and Space: Audio-visual Learning in 3D (Kristen Grauman) 2020/12. Objective : indoor robot mapping & navigation.

Towards Graph-Based Spatial AI (Andrew Davison) 2020/10. SLAM evolving into Spatial AI.

A Question of Representation in 3D Computer Vision (Bharath Hariharan) 2020/09. Task: image -> 3d bbx and 3d shape output. performs badly on various benchmarks.

Learning to Walk with Vision and Proprioception (Jitendra Malik) 2022/01. “we see in order to move and we move in order to see”. “Anaxogaras: It is because of his being armed with hands that man is the most intelligent animal”. Rapid Motor Adaptation for Legged Robots 2021

  1. Walking in simulation.
    • Pervious Works: Animal Gaits. Computational Gaits - Central Pattern Generators. Real People Gaits.
    • RL: Environmental Factor Encoder + State & Old Action -> Base Policy -> Action. While minimize work and ground impact.
  2. Walking in real world (blindly) via rapid motor adaption.
    • Some of the environment variables are unavailable. Adaptation Module : Use history action&state to estimate the environment variables.
    • Adaptation Module can be pre-trained in simulation by Environmental Factor Encoder.
  3. Walking at different linear and angular velocities in the real world (change target speed): Minimizing Energy Consumption Leads to the Emergence of Gaits in Legged Robots 2021.
    • Robots shows different gaits at different speed.
  4. Navigation to a point goal with vision and proprioception: Coupling Vision and Proprioception for Navigation of Legged Robots 2021, robot with RGBD camera.
    • Occupancy Map & Cost Map -> Velocity Command Generator.
  5. Epliogue : Layered Sensorimotor Architectures meet Deep RL.

6. Language

Explainability and Compositionality for Visual Recognition (Zeynep Akata) 2021/01.

  • Learning with Explanation with Minimal Supervision — Zero-Shot Learning.
    • Image -> Image Features <-(F)-> Class Attributes <- Class Labels.
    • Zero-Shot Learning Train the mapping F. But human made Attributes is needed.
    • Data Augmentation : Text-to-Image GAN. Text-to-ImageFeature GAN/VAE.
  • Generating Explanations using Attributes and Natural Language — Image-to-Text.
    • towards effective human-machine communication.
  • Summary, Ongoing work and future work.

7. AR/VR/MR

Photorealistic Telepresence (Yaser Sheikh) 2020/12, from facebook. Face-to-face social interaction in distance. True presence rather than “perceptually plausible” — Enable Authentic Communication in Artificial Reality.

8. 3D Objects

AI for 3D Content Creation (Sanja Fidler) 2020/09, NVIDIA Kaolin.

  • Manual Creation is Slow (e.g. GTA).
  • Worlds (Scene Composition) :
    • Scene layout : probabilistic grammar, Meta-Sim 2019, Meta-Sim2 2020: (1) encode scene with GNN; (2) distribution matching by comparing images; (3) task optimization.
    • Assets : Make graphic rendering differentiable -> able to train.
  • Other works. GameGAN 2020

Shape Reps: Parametric Meshes vs Implicit Functions (Gerard Pons-Moll) 2020/09, Realistic virtual humans : generation & perception, with different representations.

Joint Learning Over Visual and Geometric Data (Leonidas Guibas) 2021/08.

  • Multi-Modal 3D object Detection.
  • SE3 equivalent networks.
  • Category-Level Object Pose Estimation.
  • Latent Spatio-Temporal Representations.
  • Exploiting Consistency among Learning Tasks.

Making 3D Predictions with 2D Supervision (Justin Johnson) 2022/08

  • Mesh R-CNN 2019 : Supervised Shape Prediction, single image -> 3d bbx detection with mesh.
    • Pixel2Mesh 2018: Iterative mesh refinement (deformation), but has limitation on topology.
    • this paper -> deforming 3d object voxels into mesh.
  • Differentiable Rendering + PyTorch3d 2020, differentiable render 3d geometry to 2d to make 2d loss.
    • Traditional Pipeline: Rasterization (not differentiable: boundaries are not continuous) + Shading.
    • Solution SoftRas 2019: blur the boundary to be continuous.
    • Refinement this paper (more efficient): K nearest faces; coarse-to-fine; move shading to pytorch; heterogenous batching.
  • Unsupervised Shape Prediction from single view, trained with Differentiable Rendering.
    • Trained with a second view. Mesh predicts: offset for each vertex (from template sphere mesh).
    • Sphere GCN (graph convolution) model (out performs Shpere FC).
  • SynSin : Single-Image View Synthesis 2020, trained by images (video) only.
    1. Predict per-pixel features + depth.
    2. Projection by transformation (features & depth) to new view.
    3. Generator to predict image.

9. Others

The Moon Camera (Bill Freeman) 2020/10. attempts to photograph the Earth from space using the moon as a camera, and several Computational imaging projects resulting from those attempts.

  • Approaches 1. Measuring diffuse reflections of Earthshine from the Moon.
  • Approaches 2. Observing the fuzzy boundaries of cast shadows of Earthshine on the Moon.
  • Approaches 3. Measuring the specular reflections of modulations within sunlight.
    • Intensity change; spectrum change; modulation spectrum change.