ICP Covariance
Line Feature Mapping
Omnidirectional Camera
XR Hand
- Meta
- Apple
- PICO
- Infrared Papers
- Other Papers
Continuous-Time Batch Calibration
Image-based Rendering - MPIs
ICCV 23
3D Object Tracking

1. ICP Covariance

ICP error source:

wrong convergence (to local minimial), error of the initial pose estimation.
under-constrainted situation: the problem is indeterminted.
miss match.
sensor noise.

An accurate closed-form estimate of ICP’s covariance 2007. Use hessien matrix as the estimation of the covariance (but this method in some cases greatly over-estimates thte true covariance):

\[cov(\hat{x}) \approx 2\frac{residual}{K-3} [\frac{\partial^{2}}{\partial x^{2}}residual]^{-1}\]

This paper develop the following closed-form method :

\[cov(x) \approx [\frac{\partial^{2}}{\partial x^{2}}J]^{-1} [\frac{\partial^{2}}{\partial z\partial x}J]^{T} cov(z) [\frac{\partial^{2}}{\partial z\partial x}J] [\frac{\partial^{2}}{\partial x^{2}}J]^{-1}\]

A Closed-form Estimate of 3D ICP Covariance 2015. Based on the upper paper, and solve for point-to-point case.

On the Covariance of ICP-based Scan-matching Techniques 2016. Analysis the upper hessien based method. Find that the upper method fit for point-to-plane icp, but not for point-to-point icp.

A New Approach to 3D ICP Covariance Estimation 2019. Add an additional term for the covariance from the initial pose estimation.

2. Line Feature Mapping

3D Line Mapping Revisited 2023, github. my version with colmap interface. ETH, STATE-OF-ART. line mapping using sfm result (camera poses & world points).

Line Proposal : line match -> point-guided line triangulation (to overcome degenerate cases).
- using Orthonormal Representation.
- line feature : DeepLSD, descriptors : LineTR.
- line matcher : GlueStick(superglue for lines).
Proposal Scoring & Track Association.
Joint Optimization.
Test localization in our benchmark, no improvement seen (more details in my repo).

UV-SLAM: Unconstrained Line-based SLAM Using Vanishing Points for Structural Mapping 2021. using vanishing points for structural mapping, to avoid degeneracy in Plucker representation.

PL-SLAM: a Stereo SLAM System through the Combination of Points and Line Segments 2017. Using the orthonormal representation of lines, and 3d point representation of points, to process visual slam (basicly ORBSLAM2 structure). And the first paper to derivative the line jacobians with detail.

impact of landmark parameterization on monocular ekf-slam with points and lines 2010 Project lines into camera image space.

structure-from-motion using lines : representation triangulation and bundle adjustment 2005, based on Plucker representation of the line (by two points or two planes: the direction of the line, and the moment). The paper proposed a Orthonormal Representation of lines, takes only 4 dof (three from SO(3) and one from SO(2)), make it easier for optimization.

Used this factorization in our project, it performs well. But in actually localization applications, point feature is much more robust than this method.
this should fits better for traffic lanes mapping, with fixed poses.

3. Omnidirectional Camera

3.1 Calibration

Single View Point Omnidirectional Camera Calibration from Planar Grids 2007 (opencv fisheye model based on this paper).

A Multiple-Camera System Calibration Toolbox Using A Feature Descriptor-Based Calibration Pattern (opencv calibration based on this paper).

3.2 Anti-Aliasing

Anti-Aliasing is important when converting panorama images to pinhole images.

Anti-aliasing techniques comparison. Spatial anti-aliasing.

SSAA (Supersampling anti-aliasing). In the objective image, pick some pixels around, project back to the original image (panorama image for our case) to get colors, and averaging.
MSAA (Multisample anti-aliasing), boost over SSAA share the samples among different objective pixels.
Post-process anti-aliasing: FXAA, SMAA, CMAA, etc.
Signal processing approach: to greatly reduce frequencies above a certain limit, known as the Nyquist frequency.

3.3 Reconstruction

Egocentric Scene Reconstruction from an Omnidirectional Video, github. Fuse per-frame depth estimates into a novel spherical binoctree data structure that is specifically designed to tolerate spherical depth estimation errors.

4. XR Hand

4.1 Meta

META blogs 2019

Blob segmentation
- Image pyramids to find blobs in different scale, not for all frames. to handle : separate merged blobs, detect faint blobs, center of a close blob.
- in noisy scene : holiday lights and trees:
  - detects stationary 3D lights and reject them.
  - use CNN to validate blobs.
LED Matching.
- “brute matching” check all the hypotheses. “proximity matching” with prior information of pose.
- all the blobs in the four images will be collected to match.
- develop fewer points (1 point, 2 points) match algorithms.

No more blogs released after Dec 2019, but more hand tracking updates are available.
My implementation:

4.2 Apple

Apple Vision Pro 2023

Design for spatial input 2023.
- eye tracking -> target. tap finger -> select. flick finger -> scroll.
- could process complete hand tracking in some cases.

Detect Body and Hand Pose with Vision 2020 other people’s pose.

4.3 PICO

PICO Centaur 光学追踪+裸手识别 2023; LED + AI HAND + IMU.

HaMuCo hand tracking 2023.
- self-supervised from multi-view pseudo 2D labels.
- cross-view-network following multiple single-image-network to merge multi-view result. (Designed for VR 4-cameras system)
Decoupled Iterative Refinement Framework for Interacting Hands Reconstruction from a Single RGB Image 2023, for two hands reconstruction.
Reconstructing Interacting Hands with Interaction Prior from Monocular Images 2023, for two hands reconstruction.
Data Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling.
XR body recovery.

4.3 Infrared Papers

A comparative analysis of localization algorithms for visible light communication 2021.

Light-based indoor positioning systems: A review 2020

LEDs based method. Data packets are transmitted through the optical channel using a modulation method (e.g On-Off Keying - high frequency switching of the LEDs).
- Multiplexing to distinguish different LEDs - Time/Frequency/Orthogonal Frequency/Wavelength.
- Positioning : Proximity/Signal Strength/Angle of Arrival/Time of Arrival
IR
- Oculus Rift DK2 2014: LEDs transmit their own IDs by on-off keying as a 10-bit data packet at 60Hz.
Coded marker-based optical positioning systems.

Low-cost vision-based 6-DOF MAV localization using IR beacons 2013. Enumerate all possible 2d-3d matches, filter by plane prior (order around the centroid is kept), then solve pose by PnP.

PS Move API: A Cross-Platform 6DoF Tracking Framework 2013, with a more detailed version Cross-Platform Tracking of a 6DoF Motion Controller 2012. developed for PS Move Motion Controller: single large LED blob tracking.

Kinectrack: Agile 6-DoF Tracking Using a Projected Dot Pattern 2012. plannar IR pattern: 4 points -> quads -> kites. Kites have a perspective-invariant signature, used to match and compute pose.

Affordable infrared-optical pose-tracking for virtual and augmented reality 2007. multi-view construction, then 3d model fit (maximum-clique search) to get pose.

4.4 Other Papers

Efficient 6-DoF Tracking of Handheld Objects from an Egocentric Viewpoint 2018. Image based 3d position & 6 dof pose.

data set for hand hold objects. the data set might be useful.
Model based on Single Shot Multibox Detector (SSD). Intuition : users’ hands and arms provide excellent context.

1 euro Filter: A Simple Speed-based Low-pass Filter for Noisy Input in Interactive Systems 2012, here for an implementation One Euro Filter. Lower jitter at low speed, lower lag at high speed.

\[\alpha = \frac{1}{1 + \frac{\tau}{T_{e}}}, \tau = \frac{1}{2\pi + f_{c}}, f_{c} = f_{c_min} + \beta \| \dot{\hat{X_{i}}} \|\] \[\hat{X_{i}} = (X_{i} + \frac{\tau}{T_{e}} \hat{X_{i - 1}}) \frac{1}{1 + \frac{\tau}{T_{e}}}\]

Monado’s hand tracking, stream app:

post machine learning hand pose, project gitlab. multi-stage neural network-based solution.
post Bag of freebies, pretrained model gitlab.
- Data augmentation + Noisy Student Training, a semi-supervised learning approach.
- Architecture inspired by the YOLOv4 architecture
post monado hand tracking:
- fits with the ethos of libsurvive (an Open Source Lighthouse(inside-out) Tracking System).
- using One Euro Filter.
- using MediaPipe: MediaPipe samples， MediaPipe c++.

5. Continuous-Time Batch Calibration

Calibrating the Extrinsics of Multiple IMUs and of Individual Axes 2016. Add multiple IMUs based on previous works. Unified Temporal and Spatial Calibration for Multi-Sensor Systems 2013. Add timestamp parameter based on previous work. Continuous-Time Batch Estimation using Temporal Basis Functions 2012. My Notes.

Use a serial of bsplines to simulate the trajectory, since bspline is continous (if degree is high enough), the trajectory will be smooth, and could compute derivative w.r.t. time to get acceleration and angular velocity. forme the optimization problem with :

map point observations.
imu measurements : 2nd derivative of position, and 1st derivative of rotation.
control input constraints.

General Matrix Representations for B-Splines 1998. used in upper papers to generate bsplines.

6. Image-based Rendering - MPIs

Some References:

Implicit Representations (Light Field - Plenoptic Function) - using position & direction of each pixel (5-dim), to get its color, depth and other meta-information. My Neural Rendering Notes

Light Field Traditional stores as a grid of images or videos - Holographic Stereograms 4d light field embedded in 2d sensors (~fly eyes) - Light Fields 101 - SVVR 2016. Light Field could product better VR image than ray tracing.
- The Plenoptic Function and the Elements of Early Vision 1991
- The Lumigraph 1996, Light Field Rendering 1996. 4D representation (since cameras sit in a plane) : (s, t) ~ position, (u, v) ~ direction.
- Dynamically Reparameterized Light Fields 2000, video explain, video demo.
- Plenopticam 2019, github.
- Light Field Camera Lytro.
Light Field Networks & NERF method to render new views.
- Light Field: you directly predict colors from light rays. Deep blending 2018, Free View Synthesis 2020.
- NERF: performing volume rendering (integration along the ray).

Layered Representations:

Depth - Interpolation of RGBD images:
- Apple View Interpolation for Image Synthesis 1993, similar to image morphing.
  - (1) establishes the correspondence between two images (hard part); (2) use the mapping to interpolate the shape of each image toward the other (~ cv::remap).
  - this paper uses the camera transformation and image range data to automatically determine the correspondence.
    - quadtree block compression of pixels for parallel processing.
- Layered Depth Image 1998
- Sprites with Depth: overlapping depth images.
- Virtual Viewpoint Video 2004, render bullet time video.
  - extand boundary to create better (blending) effect.
Aspen Movie Map (1978)
Apple QuickTime VR – An Image-Based Approach to Virtual Environment Navigation 1995, 360 video based image walkthrough, while the viewpoint is fixed.

Multi-Plane Images (MPIs):

Method python implementation:
- warping : homography.
- compositing of layers (1 for furthest, k for closest) : \(I = \sum_{i=1}^{k}(c_{i}\alpha_{i}\prod_{j=i+1}^{k}(1-\alpha_{j}))\) \(D = \sum_{i=1}^{k}(d_{i}^{-1}\alpha_{i}\prod_{j=i+1}^{k}(1-\alpha_{j}))\)
Multiplane Camera 1937
Stereo Matching with Transparency and Matting 1998
Crowdsampling The Plenoptic Function 2020, Deep Multi-plane Images. RGBA, and learnable latent feature vector (for time). render is fast. Produce more stable compare to Nerf-Wild.
Stereo Magnification: Learning View Synthesis using Multiplane Images 2018, MPIs with stereo input.
- Single-view view synthesis with multiplane images 2020 (32-layers), github, predict the mutli-plane images from single image. using colmap sparse point cloud and target image (from online videos) to train.
- Single-View View Synthesis in the Wild with Learned Adaptive Multiplane Images 2022 (8-64 layers, pretrained 32&64 are available). trained in wild dataset (COCO) (by mono-depth wrapped images).
  - MPI over-parameterization problem : use encoder-decoder architecture.
  - Suboptimal depth problem : apply inter-plane interaction.

Single-view view synthesis test with deepmirror office.

SynSin: End-to-end View Synthesis from a Single Image 2019 with depth feature, and network to merge images.
DeepView View Synthesis with Learned Gradient Descent 2019, multi-view to MPIs, too hard to train, hanged by Google.
MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images 2020, github. conert stereo 360 to MPIs.
MINE: Towards Continuous Depth MPI with NeRF for Novel View Synthesis 2021, multi-plane volume render.
NeX: Real-time View Synthesis with Neural Basis Expansion 2021 (192-layers, with 16 texture images), parameterizing each pixel as a linear combination of basis functions (based on view angle) learned from a neural network.
- 192-layers, with 16 texture images, too large memory.
- 17 images scene took 18h to train, trainning slow, limit its use case.
Real-Time Neural Character Rendering with Pose-Guided Multiplane Images 2022, use image-to-image translation paradigm.
Apple Generative Multiplane Images 2022 (32-layers) but only has pre-trained model for face dataset. (Apple might use this for Vision pro 3d photo)
Structural Multiplane Image 2023, planes made based on planar 3D reconstruction of the scene.
- since planes could intersect, need to order the render sequence for each pixel - slow.

MPIs Final choice : Single-View View Synthesis in the Wild with Learned Adaptive Multiplane Images 2022, our version, (Single-view view synthesis with rgbd trained on COCO). Could run on VR & Phone.

Use rbgd as input, predict density 𝜎 for each plane instead of alpha 𝛼 .
Plane Adjustment Network. arranging each MPI plane at an appropriate (pre-defined) depth to represent the scene.
Radiance Prediction Network. predicts the color 𝑐 𝑖 and density 𝜎 𝑖 for each plane at 𝑑 𝑖 .
Train using single image : supervised by RGBD wrapping + Hole filling network.
TODO: supervision by youtube videos.
TODO: single view 3D gaussian splitting might help?.
Implementation (Phone version & Pico version) of a OpenGLES shared based MIP visualizer.

7. ICCV 23

ICCV’23 Robot Learning & SLAM Workshop

Marc Pollefeys: Visual Localization and Mapping From Classical to Modern SFM & Visual Localization. 3DV 2024.

Point Features:
- PixLoc 2021 end-to-end learn from pose loss.
- Pixel-Perfect SFM 2021 refine 2d feature position by dense NN descriptor.
- LightGlue 2023
Privacy-Preserving Geometric Computer Vision.
Line Features: DeepLSD 2023, GlueSticks 2023 -> LiMap 2023.
LaMAR 2022 AR Benchmarking.

Maurice Fallon: Robust Multi-Sensor SLAM with Learning and Sensor Fusion. 3 camera + lidar system.

Lidar-Visual Odometry:
- VILENS 2021, joint optimization of lidar & visual & imu resiudals.
- Hilti-Oxford SLAM Dataset 2023.
InstaLoc 2023 through dense lidar semantic instances matching.
NavLive 2022
Lidar Vision NeRF.
- Lidar-Camera Calibration - [Extrinsic Calibration of Camera to LIDAR using a Differentiable Checkerboard Model 2023].
- [SiLVR : Scalable Lidar-Visual Reconstruction with Neural Radiance Fields 2023] nerf + lidar depth + lidar normal.
SLAM + LLMs : Language-EXtended Indoor SLAM (LEXIS) 2023 building semantically rich visual maps with LLMs, based on CLIP.

Luca Carlone: From SLAM to Spatial Perception. hierarchical representations, certifiable algorithms, and self-supervised learning.

Scene Map : Kimera: Real-time Metric-Semantic SLAM 2021. 3D scene underestanding : semantics (objects, agents, sounds, etc), relations. Kimera-Multi 2023, multi-robots.
Robustness : Certifiable algorithms compute an estimate and either certify its optimality, or detect failure. Kimera-RPGO
- ROBIN 2023 based on graph theory to find large sets of compatible measurements and prune gross outliers (used in TEASER++ 2020).
- GNC + ADAPT 2021 graduated non-convexity (to reduce non-convexity of the optimization).
- Certifiable Outlier-Robust Geometric Perception 2022 semidefinite moment relaxations.
- Self-supervised Learning for Certification.

Chen Wang: Imperative SLAM and PyPose Library for Robot Learning, Imperative SLAM 2023. Take back-end optimization as a supervision signal for the front-end. PyPose.

Andrew Davison: Distributed Estimation and Learning for Robotics, see here for related lecture.

Reason for the thoughts: (1) Hardware: map the algorithm blocks to hardware; (2) Multi-robot systems.
Gaussian Belief Propagation.
Robot Web.
- Multi-robot localization using Gaussian Belief Propagation.
- Multi-robot planning using Gaus sian Belief Propagation.

Daniel Cremers: From Monocular SLAM to 3D Dynamic Scene Understanding.

Novel Bundle Adjustment. Super Root BA 2021, Power BA 2023, github.
Direct SLAM. LSD-SLAM, DSO, DMVIO, D3VO.
Single Image Dense Reconstruction. MonoRec 2021, Density Fields for Single View 2023.
Dynamic 3D scene understanding.

Tim Barfoot: Learning Perception Components for Long Term Path Following.

Shubham Tulsiani: Probabilistic Pose Prediction. Objective : 3D object reconstruction. Pose Estimation from few views. SFM (e.g. Colmap) not robust under sparse-views. Data-driven learning method.

Direct Pose Prediction (end-to-end) try : failed ! I think the problem might be with the pose representation, see Why NeRF work ?.
RelPose++ 2023. Probabilistic Pose Prediction: predict the distribution of poses though energy-based model.

Ayoung Kim: Advancing SLAM with Learning. (1) Lines. Line Descriptor: LineRT 2021; (2) DL + Graph SLAM. Object SLAM : 6dof object pose estimation; (3) Thermal cameras.

Michael Kaess: Learning for Sonar and Radar SLAM. Camera fails in under-water environments.

Sonar : projection without elevation. Acoustic SFM. Epipolar contour. Acoustic Bundle Adjustment.
- Sonar Image Correspondence. DL method.
- Imaging Sonar Dense Reconstruction.
Radar SLAM, provide Doppler velocity also.

8. 3D Object Tracking

8.1 Traditional Methods

Region-based method: region segmentation + optimization. Use color statistics to model the probability that a pixel belongs to the object or to the background. The object pose is then optimized to best explain the segmentation of the image.

Pros & Cons:
- Pros: work for textureless objects. more robust.
- Cons: mostly expensive. assuming objects are distinguishable from the background.
Two-stage method. (1) segmentation finding the contour; (2) contour points to rays (plucker representation), match the rays with 3d object.
One-stage method. PWP3D: Real-time Segmentation and Tracking of 3D Objects 2012: optimization of the pose, based on the fore-back-ground field (using SDF). (similar to a direct method but works on SDF field)
- problem define: maximizing the energy function, w.r.t. pose, \(E(\Phi) = - \sum_{x\in \Omega} log(H_{e}(\Phi)P_{f} + (1- H_{e}(\Phi))P_{b})\), with \(\Phi\) the SDF from the projected object.
- optimization. (with great evaluation of different choices)
  - gradient descent. use small step (to avoid jump over minima). Final choice.
  - conjugate gradient, (1) use Hessian as preconditional matrix; (2) evaluate the energy sometime, to check if steepest descent is needed to reset. a bit fast for translation, but slow for rotation (compared to gradient descent).
Sparse method & 3DObjectTracking : DLR-RM:
- RBGT: A Sparse Gaussian Approach to Region-Based 6DoF Object Tracking 2020.
  - Foreground/Background : color histogram.
  - Sparse probabilistic model : corresponding lines following gaussian distribution.
  - Optimization using second-order Newton optimization with Tikhonov regularization.
- SRT3D: A Sparse Region-Based 3D Object Tracking Approach for the Real World 2021, github. Add a global local optimization.
- ICG - Iterative Corresponding Geometry: Fusing Region and Depth for Highly Efficient 3D Tracking of Textureless Objects 2022: merged region-based and depth-based method. (100Hz in CPU)
  - Sparse Viewpoint Model: contour points and surface points from pre-rendered view point.
  - Region Modality : following previous methods.
  - Depth Modality : point-to-plane ICP.
- ICG+ - Fusing Visual Appearance and Geometry for Multi-modality 6DoF Object Tracking 2023. Add texture information to previous version : minimize reprojection errors between points from the current image and keyframes.
- Mb-ICG - A Multi-body Tracking Framework - From Rigid Objects to Kinematic Structures 2023, multi-body (jointly connected robot) tracking using ICG+, an optimization framework combining Netwon optimization with body jacobiabns.

Depth-based method: minimize the distance between the surface of a 3D model and measurements from a depth camera.

Pros & Cons:
- Cons: Depth sensor is required.
(1) point-to-plane ICP based. (2) SDF based. (3) Particle filter, Gaussian filters.

Keypoint-based method image feature extraction and match.

Pros & Cons:
- Cons: Need Texture. Heavy.
SIFT, BRISK, LIFT, SuperGlue, etc.

Edge-based method

Pros & Cons:
- Cons: Cannot handle image blur. Struggle with texture and background clutter.
Combining 3D Model Contour Energy and Keypoints for Object Tracking 2018, (1) initial pose from Kanade–Lucas–Tomasi (KLT) tracker; (2) refine pose using contour energy function (with Basin-Hopping stochastic algorithm), maximizing the image gradient along the projected contours (outer-contours & sharp edges).
Pixel-Wise Weighted Region-Based 3D Object Tracking using Contour Constraints 2021, github. project contour by initial pose, and check the foreground-background probability along the normal.

Direct method

Pros & Cons:
- Cons: Need Texture. Need perfect 3d model. Have a smaller basin of convergence and less robust to illumination changes.
A Direct Method for Robust Model-Based 3D Object Tracking from a Monocular RGB Image 2016. directly align image intensity.
DSO 2018.
My implementation :

8.2 Deep Learning Methods

6DoF Pose Estimation.

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion 2019. RGBD based combined with PointNet.
6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints 2019, RGBD 3D keypoints tracking.
CenterPose: Single-Stage Keypoint-based Category-level Object Pose Estimation from an RGB Image 2022 (NVIDIA). Point based structrue representation (similar to 3D bounding box). using ConvGRU feature association.
OnePose++: Keypoint-Free One-Shot Object Pose Estimation without CAD Models 2022 (object-SfM). (1) Extaction of 3d feature points from object (built with SfM); (2) 2D-3D dense feature match GNN, for feature matching. (3) PnP pose estimation.

With Tracking.

se(3)-TrackNet: Data-driven 6D Pose Tracking by Calibrating Image Residuals in Synthetic Domains 2020 (pure tracking). predict the relative pose between object renderings and subsequent images.
PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking 2020 (NVIDIA).
CenterPoseTrack: Keypoint-Based Category-Level Object Pose Tracking from an RGB Sequence with Uncertainty Estimation 2022 (NVIDIA). Predction of CenterPose are rendered + Previous result -> Kalman filter + Bayesian Filter -> Verify (with a network).
BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects 2023 (NVIDIA). Neural SFM + Neural SDF.

Other Specific Subjects

Table of Contents