Generation & GANs
Autonomous Driving
Image-based Rendering
Self-Supervised Learning
SLAM & Robotics
Language
AR/VR/MR
3D Objects
Others

1. Generation & GANs

New Generative Models for Images, Landscape Videos and 3D Human Avatars(Victor Lempitsky) 2021/02.

StyleGAN for Landscape Videos: DeepLandscape.
- network feature : duplicted latents - two upsampling structures (one small one large).
- discriminator : unary (use the smaller one) & pairwise (use both). warp noise maps by homography transformations.
StyleGAN for 3D Human Avatars. SMPL-X

Controllable Content Generation without Direct Supervision (Niloy Mitra) 2020/12, Smart Geometry Processing Group. Adobe.

Industrial Design : Ideation Sketching -> CAD Modeling -> 3D model.
- DL Problems : Representation, topology + geometry + material, avoid using 3d training data.
Learning from Rasterized Vector Data.
- Discovering Pattern Structure Using Differentiable Compositing 2020 - Pattern Expansion - Edit in the Wild, while preserving texture structure.
Learning from Procedural 3D Data.
- ShapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis 2020.
- Sketch2CAD: Sequential CAD Modeling by Sketching in Context 2020
Revisiting ‘Plato’s Cave’ : Escaping Plato’s Cave: 3D Shape From Adversarial Rendering 2018.
- train without access to 3d data.
- 2d image -> 3d volume -> 2d rendered image -> 2d structured samples.
Learning Object-aware Scene Representations from Unlabeled Images/Videos.

2. Autonomous Driving

Here for my paper read.

A Future With Self-Driving Vehicles (Raquel Urtasun) 2021/02.

Autonomy:

we want a system : Trainable end-to-end & Interpretable for Validation.
- End-to-end Approaches. Direct, but not interpretable.
- Autonomy Stack.
  - HD Maps /Sensors -> Perception -> Prediction -> Planning -> Control.
  - Interpretable, very bad productivity.
Joint Perception + Prediction :
- Fast and Furious 2020 lidar object prediction.
- Interaction Reasoning Network. Spatially-Aware Graph Neural Networks 2019:
  - Predict considering interaction using GNN.
  - Predicting Marginal Distributions: real world decision should be discrete - consider scenarios separately.
- V2VNet 2020: share NN-encoded sensor data between vehicles -> then using GNN. Simulation.
Joint Perception + Prediction + Planning : Uber ATG Vision: Interpretable Neural Motion Planer
1. Neural Motion Planer 2019, add a branch from the network as planner -> time & egocar position.
2. DSDNet 2020. (1) multi-modal socially-consistent uncertainty; (2) explicitly condition on prediction; (3) use prior (human) knowledge.
3. P3: Safe Motion Planning Through Interpretable Semantic Representations. Recurrent semantic occupancy map -> to avoid occupied regions.

Simulation: Structured Testing, Real World Replay, Sensor Simulation.

Lidar simulation : TrafficSim 2021, use real world data (real 3D Assets) to generate preception & prediction data.
Camera (multi-camera video) simulation : GeoSim 2021, use real world data to generate (through multi-view mulit-sensor reconstruction network).

3. Image-based Rendering

Neural Implicit Representations for 3D Vision (Andreas Geiger) 2020/09. cvpr talk pdf.

3d representations:
- Direct representation : voxels, points, meshes.
- Implicit representation : decision boundary of a non-linear classifier.
Occupancy Network : $L(\theta, \phi) = \sum_{j=1}^{K}BCE(f_{\theta}(p_{ij}, z_{i}), o_{ij}) + KL[q_{\phi}(z|(p_{ij}, o_{ij}))|p_{0}(Z)]$.
- Given the 3d model, we can further do : Texture Fields 2019 predicts each 3d point a color. Occupancy Flow 2019 predicts 4d - occupancy and velocity.
Differentiable Volumetric Rendering 2020： 3d points + encoded image vector -> occupancy and color (for all points).
- forward pass (rendering) : find surface point along the pixel ray, and get color.
- backward pass : gradient based on color difference from pixel re-projection.
NERF: integrate all the points in the ray to get color and depth. (while Occupancy Network used only the occupied one)
- GRAF 2020 predict without camera poses. sample rays (patch) and use discriminator.
Convolutional Occupancy Networks 2020, uses 3d feature volume.
- can also use Fourier Features 2020, fourier feature fits better MLP.

Reconstructing the Plenoptic Function (Noah Snavely) 2020/10, Notes.

Understanding and Extending Neural Radiance Fields (Jonathan T. Barron) 2022/10, Jonathan T. Barron. See more in My Neural Rendering Page, My Deep Learning 3D Reconstruction Page.

NeRF.
How NeRF Work ? Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. Experiments.
- Toy problem : memorizing a 2d image, a network to predict color for pixel.
  - coordinate to color - (x, y) to (r, g, b). failed.
  - coordinate’s Fourier feature (～positional encoding) to color. succeed.
- Neural Tangent Kernel (neural networks are kernel regression + ReLU MLPs corresponding to a ‘dot product’ kernel).
  - with Dot Product of Fourier Features. MLPs are made into "convolution".
- See Instant Neural Graphics Primitives for a great implementation.
Nerf in the Wild with appearance & transient embedding.

Learning to Retime People in Videos (Tali Dekel) 2020/10

Analyzing, Visualizing and Re-rendering people in videos.
- motion visualization, depth prediction, SpeedNet : adaptive speed up video
Change the speed of individual people within frames. Layered Neural Rendering for Retiming People in Video. interesting work!
- Key challenges : space-time correlations; occlusions/dis-occlusions.
- Layered Decomposition, then we can edit the video by changing the layers.

Reflections on Image-Based Rendering (Richard Szeliski) 2021/01. A overview.

Multi-View Stereo. Usage : View Interpolation, View Morphing, interactive 3d scene, etc. Idea behind: Plane Sweep Stereo (~Patch Match).
Image-Based Rendering: Depth Layers, Multi-plane Images.
360 video (panorama).
- 360 with complete light field: Google Jump 2015, Facebook Surround 360 2016. Stereo with two 360 cameras.
- Immersive Video Stabilization by ‘Spatio-Temporal MRF Stitch’ : reconstruction and merge pictures.
Large Scale Reconstruction based: cross fade between images to move from one image to other: Photo Tourism: Exploring Photo Collections in 3D, using images. Piecewise Planar Stereo for Image-based Rendering 2009, using depth layers. Ambient Point Clouds for View Interpolation 2010, using point cloud.
Simgle-Image based:
- Practical 3D Photography 2018, using iphone depth sensor.
- Using mono-depth: One Shot 3D Photography 2020. And ‘google photos cinematic effect’.
Reflections and transparency : Rear layer & normal layer. Gradient domain depth.
Neural Rendering.
- SynSin: End-to-end View Synthesis from a Single Image 2019. predict a heuristic depth map. multi-plane images with depth feature, with a decoder to generate new view.
- Animating Pictures with Eulerian Motion Fields 2021. predict a heuristic motion map. tracing the motion of depth features, and with a decoder to generate new view.

Neural Fields Beyond Novel View Synthesis (Andrea Tagliasacchi) 2023/01: View understanding, Camera knowledge, Overfitting regime. NeRF : geometry + appearance.

Learn from One Look 2022, conditioned NeRF model. hard surface loss to remove smoke effect.
Fast NeRF Rendering, Nerual Textured Mesh. MobileNeRF 2022.
- previous works : fast train (instant-ngp), fast rendering (different representation : NVDiffRec, Nerual-PIL, hardware limitation : FastNerf, DONeRF, KilNerf, SNeRG).
- Scene is a (nerual) textured mesh. And small MLP to rasterize.
Neural Semantic Field 2022: field-to-field translation network.
- Neighbor analysis the field, then generate new field.
nerf2nerf: Pairwise Registration of Neural Radiance Fields 2023, pairwise alignment of neural fields.

4. Self-Supervised Learning

On Removing Supervision from Contrastive Self-Supervised Learning 2021/01 by Alexei Efros. Self-Supervised Learning (use the tools of supervised learning, but with raw data instead of human-provided labels):

Self-Supervised Learning Allow to get away from top-down (semantic) categorization. (jump out of concrete objects, to reach IDEE of Plato)
- Per-exemplar SVM : Recognition by Association via Learning Per-exemplar Distances 2008, Exemplar-SVM 2011, Exemplar-CNN 2014.
- Similarity Learning (Constrastive Learning), learning the distances between data.
- Data Augmentation boost similarity learning. and even as supervision to learning (“leak in”) - What Should Not Be Contrastive in Contrastive Learning 2021.
- Constrastive Learning without Data Augmentation - Time as Supervisory Signal（Temporal Continutiy is important to animals）:
  - Video as graph.
  - Contrastive Learning for Unpaired Image-to-Image Translation 2020: using GAN loss, close in structure space, and far in texture space.
Self-Supervised Learning Enable continuous life-long learning.
- we never see the same ‘training data’ in real life. Data augmentation encourage memorizing. -> Online Continual Learning. keep using new data to train.
- Test-Time Training 2020, use self-supervised to adapt new data.
- （实践是交互性的，机器要想更像人就也需要实践，那么仅仅单向地给它数据肯定是不够的，需要它以一种方式和客体发生作用才行。而且这种作用不能只是机械的，而且需要有“能动性”。）

Learning Representations and Geometry from Unlabeled Videos (Andrea Vedaldi) 2021/01. horizontal problems, vertical problems. Contrastive Learning : vector representations.

Video Timeshift and Inverse: Multi-modal Self-Supervision from Generalized Data Transformations
Video with Caption: Captioning as a modality for contrastive learning. Support-set bottlenecks for video-text representation learning, using cross-captioning to be robust against wrong caption.
Image/Video Labelling:
- Clustering the representation vectors. Deep Clustering for Unsupervised Learning of Visual Features learns the clustering and the representation network.
- Self-labelling via simultaneous clustering and representation learning, label assignment by probability.
- Labelling unlabelled videos from scratch with multi-modal self-supervision

Video-to-Geometry: Autoencoding encode to 'shape code' (2d landmarks), then use decoder to reconstruct the original image.
- Learning Landmarks from Unaligned Data using Image Translation.
- Exemplar Fine-Tuning for 3D Human Model Fitting, video to human 3d model.
- C3DPO - Canonical 3D Pose Networks for Non-rigid Structure From Motion. 2d landmarks to predict model and camera pose.
- Canonical 3D Deformer Maps, predicts both depth maps and canonical maps.
- Texture transfer, Use Symmetry as supervision.

5. SLAM & Robotics

New Methods for Reconstruction and Neural Rendering (Christian Theobalt) 2020/11

Monocular reconstruction : human hand, human skeleton, human performance (surface), 3d face.
Nerf : Deep relightable texture. StyleRig -> pose & light.
Neural Sparse Voxel Fields 2020.

Pushing Factor Graphs beyond SLAM (Frank Dellaert) 2020/12, GTSAM. Factor Graph Introduction. user case : Skydio drone, navigation, tracking and motion planning.

SLAM & GTSAM. Sparse Hessian Matrix - Bayes Tree : Incremental & Distributed (sub-trees).
- iSAM 2012, (ICE-BA 2018).
Structure from Motion. GTSFM (it is really a nice work.), parallelize SFM over large clusters, using DASK.
- DMV (Detection/Description + Matching + Verification) -> Essential Matrix.
- Shonan Rotation Averaging
Navigation and Control. IMU-preintegration factor is integrated inside GTSAM.
More.
- Batch and Incremental Kinodynamic Motion Planning using Dynamic Factor Graphs. use factor graphs to encode robot dynamics and applied to kino-dynamic motion planning.
- Optimize control parameters for drone planning.
- SwiftFusion integration with TensorFlow, functions can be made differentiable automatically.

Sights, Sounds, and Space: Audio-visual Learning in 3D (Kristen Grauman) 2020/12. Objective : indoor robot mapping & navigation.

SoundSpaces : Realistic 3D environments and simulation - with 3D sound.
Audio-visual embodied Navigation 2019 : vision + audio + gps -> Critic + Actor -> Action Sampler. (Finding alert task).
- audio-visual waypoints.
Semantic audio-visual Navigation 2020, put all the environmental noise together.
Audio-Visual Floorplan Reconstruction 2020, github, semantic room mapping. sound contains information of geometry.
VisualEchoes: Spatial Image Representation Learning through Echolocation 2020. agent make sound, and listen the echos. Supervision from acoustically interacting with the physical world. very interesting topic!
- help in depth/normal estimation and navigation tasks.
- VisualEcho-Net + Echo-Net -> Predict Orientation. (self-supervised echo and visual results should match)

Towards Graph-Based Spatial AI (Andrew Davison) 2020/10. SLAM evolving into Spatial AI.

FutureMapping: The Computational Structure of Spatial AI Systems 2018
- Representation is important (End-to-end might not be possible).
- There should be a generality to Spatial AI system (for various applications).
SLAM: MonoSLAM 2003, ElasticFusion 2016, SemanticFusion 2017.
New Representations for Spatial AI:
- keyframes : CodeSLAM 2018, SceneCode 2019, per-frame code for depths & semantics.
- Dynamic Scene Graphs. SLAM with objects : MoreFusion 2020, NodeSLAM 2020.
Hardware : Event Cameras. (Code Design on) Processors.
Gaussian Belief Propagation for Spatial AI: propagate the covariance of each node, through the graph.
- Bundle Adjustment on a Graph Processor 2020.

A Question of Representation in 3D Computer Vision (Bharath Hariharan) 2020/09. Task: image -> 3d bbx and 3d shape output. performs badly on various benchmarks.

NN to include inductive biases for 3d reasoning. large gap between stereo (depth image based NN) & lidar (pointcloud based NN).
- 2d depth neighborhood is not 3d neighborhood, 2d convolution not ideal for 3d structure.
- Psudo Lidar Pipeline : PLUMENet: Efficient 3D Object Detection from Stereo Images 2021, convert depth image to Bird-Eye-View -> significantly narrow the gap.
Ouput representation of 3d that capture priors, allow learn from pointcloud.
- PointFlow : 3D Point Cloud Generation with Continuous Normalizing Flows 2019
Training regimes that remove reliance on millions of labeled examples.
- feature descriptors training:
  - need 2d match ground truth generated by SFM -> Biased.
  - image homography : limited.
- Learning Feature Descriptors using Camera Pose Supervision, github.
  - camera pose -> epipolar loss.
  - differentiable matching : probability distribution (heat map).

Learning to Walk with Vision and Proprioception (Jitendra Malik) 2022/01. “we see in order to move and we move in order to see”. “Anaxogaras: It is because of his being armed with hands that man is the most intelligent animal”. Rapid Motor Adaptation for Legged Robots 2021

Walking in simulation.
- Pervious Works: Animal Gaits. Computational Gaits - Central Pattern Generators. Real People Gaits.
- RL: Environmental Factor Encoder + State & Old Action -> Base Policy -> Action. While minimize work and ground impact.
Walking in real world (blindly) via rapid motor adaption.
- Some of the environment variables are unavailable. Adaptation Module : Use history action&state to estimate the environment variables.
- Adaptation Module can be pre-trained in simulation by Environmental Factor Encoder.
Walking at different linear and angular velocities in the real world (change target speed): Minimizing Energy Consumption Leads to the Emergence of Gaits in Legged Robots 2021.
- Robots shows different gaits at different speed.
Navigation to a point goal with vision and proprioception: Coupling Vision and Proprioception for Navigation of Legged Robots 2021, robot with RGBD camera.
- Occupancy Map & Cost Map -> Velocity Command Generator.
Epliogue : Layered Sensorimotor Architectures meet Deep RL.

6. Language

Explainability and Compositionality for Visual Recognition (Zeynep Akata) 2021/01.

Learning with Explanation with Minimal Supervision — Zero-Shot Learning.
- Image -> Image Features <-(F)-> Class Attributes <- Class Labels.
- Zero-Shot Learning Train the mapping F. But human made Attributes is needed.
- Data Augmentation : Text-to-Image GAN. Text-to-ImageFeature GAN/VAE.
Generating Explanations using Attributes and Natural Language — Image-to-Text.
- towards effective human-machine communication.
Summary, Ongoing work and future work.

7. AR/VR/MR

Photorealistic Telepresence (Yaser Sheikh) 2020/12, from facebook. Face-to-face social interaction in distance. True presence rather than “perceptually plausible” — Enable Authentic Communication in Artificial Reality.

CODEC AVATARS : Deep Appearance Models for Face Rendering 2018
- Encoder/Decoder structure : Human -(encoder)-> code -(decoder)-> Texture & Mesh -> Face.
- Training Data : Mugsy - all angle camera shot.
- sensors : 4 eye cameras, 3 month cameras.
Nerf based 3d reconstruction.
Hand Tracking, even very complex gestions. Constraining Dense Hand Surface Tracking with Elasticity 2020.
Audio.

8. 3D Objects

AI for 3D Content Creation (Sanja Fidler) 2020/09, NVIDIA Kaolin.

Manual Creation is Slow (e.g. GTA).
Worlds (Scene Composition) :
- Scene layout : probabilistic grammar, Meta-Sim 2019, Meta-Sim2 2020: (1) encode scene with GNN; (2) distribution matching by comparing images; (3) task optimization.
- Assets : Make graphic rendering differentiable -> able to train.
  - car generation : StyleGANRender 2021.
  - DefTet 2020, deform Tetrahedral Meshes.
Other works. GameGAN 2020

Shape Reps: Parametric Meshes vs Implicit Functions (Gerard Pons-Moll) 2020/09, Realistic virtual humans : generation & perception, with different representations.

SMPL : A Skinned Multi-Person Linear Model 2015-2022
Parametric Meshes. Self-supervised using video : video -> pose, shape, close -> 3d model -> video.
- SMPL + Cloths : People Snapshot Dataset 2018: pose, shape, cloth.
- SMPL + Garments : MULTI-GARMENT NET 2019. predict body and garment models. TAILORNET 2019 learn cloths deform.
- Mesh based representation cannot handle complicate topology.
Implicit Functions. (~SDF).
- Implicit Functions Net - For 3D Shape Reconstruction and Completion 2020: multi-scale grid features -(decoder)-> [0, 1] (inside/outside).
  - work for watertight surfaces only. cannot model open manifold, nor inner structures.
- Neural Distance Field 2020 : predict [0, 1] -> R (unsigned distance).
Hybrid Model.
- Unsupervised Shape and Pose Disentanglement for 3D Meshes 2020 pose and shape to control output, with implicit code.

Joint Learning Over Visual and Geometric Data (Leonidas Guibas) 2021/08.

Multi-Modal 3D object Detection.
SE3 equivalent networks.
Category-Level Object Pose Estimation.
Latent Spatio-Temporal Representations.
Exploiting Consistency among Learning Tasks.

Making 3D Predictions with 2D Supervision (Justin Johnson) 2022/08

Mesh R-CNN 2019 : Supervised Shape Prediction, single image -> 3d bbx detection with mesh.
- Pixel2Mesh 2018: Iterative mesh refinement (deformation), but has limitation on topology.
- this paper -> deforming 3d object voxels into mesh.
Differentiable Rendering + PyTorch3d 2020, differentiable render 3d geometry to 2d to make 2d loss.
- Traditional Pipeline: Rasterization (not differentiable: boundaries are not continuous) + Shading.
- Solution SoftRas 2019: blur the boundary to be continuous.
- Refinement this paper (more efficient): K nearest faces; coarse-to-fine; move shading to pytorch; heterogenous batching.
Unsupervised Shape Prediction from single view, trained with Differentiable Rendering.
- Trained with a second view. Mesh predicts: offset for each vertex (from template sphere mesh).
- Sphere GCN (graph convolution) model (out performs Shpere FC).
SynSin : Single-Image View Synthesis 2020, trained by images (video) only.
1. Predict per-pixel features + depth.
2. Projection by transformation (features & depth) to new view.
3. Generator to predict image.

9. Others

The Moon Camera (Bill Freeman) 2020/10. attempts to photograph the Earth from space using the moon as a camera, and several Computational imaging projects resulting from those attempts.

Approaches 1. Measuring diffuse reflections of Earthshine from the Moon.
- Sphere Render : Visual Appearance of Matte Surfaces 1995. Probability model : Light transport matrix + Prior Covariance + Noise.
- Non-line-of-sight image research.
Approaches 2. Observing the fuzzy boundaries of cast shadows of Earthshine on the Moon.
- Crater - shadows border changes, as different amount of Earthshine is available.
- Occlusion-based imaging : Corner camera: Turning Corners Into Cameras: Principles and Methods 2017. shadow edge contains information behind the wall.
Approaches 3. Measuring the specular reflections of modulations within sunlight.
- Intensity change; spectrum change; modulation spectrum change.

TUM AI Lecture Series

Table of Contents