Visual-Inertial Odometry for Small-sized Robots

Event: 7th International Workshop on Visual Odometry and Computer Vision · Duration: 459 min · ▶ Watch on YouTube

Abstract

This segment features four distinct talks from the 7th International Workshop on Visual Odometry and Computer Vision. The first talk introduces a weakly supervised deep visual odometry method that reduces reliance on extensive ground truth data. The second presents BAA-NGP, a novel approach for 3D object reconstruction and camera pose optimization using neural graphics primitives. The third details a camera motion estimation technique integrating RGB-D and inertial scene flow for autonomous navigation. The final talk emphasizes the importance of efficient computer vision for edge devices, outlining various optimization methods and the challenges posed by increasing model sizes and energy consumption. This segment introduces a Visual-Inertial Odometry (VIO) method, CIP-VMobile, designed for resource-constrained devices like smartphones and robot canes. The method addresses the challenge of varying camera intrinsic parameters (CIP) due to optical image stabilization (OIS) by treating CIP as a state variable within a Graph SLAM framework and using an acceleration model for initial estimation. Experimental results demonstrate improved accuracy and computational efficiency compared to baseline VIO methods. The talk also explores an extended Visual-LiDAR-Inertial Odometry (VLIO) system that integrates LiDAR data, and discusses practical challenges in real-world navigation scenarios. This segment introduces the concept of ‘Everyday Depth’ for embodied AI, focusing on extracting depth information from standard video processing with minimal extra computation. The speaker details a framework for human+AI collaboration, where human input is treated as a ‘hazy oracle’ to optimize AI inference and decision-making. The presentation covers the analytical and learned solutions for object depth estimation, including the ODMS dataset for generating synthetic training data and the ODMD model for depth prediction from motion and detection. The work demonstrates fast, low-cost, and domain-agnostic depth inference, highlighting its potential for robotics applications like grasping and navigation. This segment presents research on direct approaches to visual navigation, focusing on motion estimation and its applications in robotics. It covers classical motion estimation, visual SLAM, and introduces novel methods for efficient and minimal computational vision, particularly using event-based cameras. The talk delves into the concepts of structure from motion, optical flow estimation, and addresses challenges like the aperture problem and bias in traditional methods. A significant portion is dedicated to the development and application of microsaccade-inspired event cameras (AMI-EV) to improve data quality, robustness in low-level tasks like feature tracking and motion segmentation, and high-level tasks such as human detection and motion estimation, especially in challenging environments and at high speeds. This segment features two talks on advanced topics in computer vision and robotics. The first talk by Michael Gleicher introduces miniature Time of Flight (ToF) sensors for robot perception, detailing their operational principles, challenges in data interpretation, and advanced techniques like differentiable rendering to achieve accurate 3D reconstruction and surface property recovery. The second talk by Amit K Roy-Chowdhury focuses on scene understanding for safe and autonomous navigation, presenting methods for robust person detection, multimodal semantic segmentation, dynamic scene graphs, and decision-making in complex environments. This segment introduces Amazon’s Dash Cart, a smart shopping cart equipped with sensors to enable a seamless shopping experience by skipping checkout lines. The core technical challenge addressed is mapping and real-time localization within large retail stores (up to 100k sq ft) using a low-cost, battery-powered system. The speaker details a decoupled approach for offline map generation via SLAM and real-time relocalization using GPU-optimized image matching, along with a method to infer product locations based on shopper purchase data.

Speakers

Lars Hinneburg — Spleenlab GmbH
Sainan Liu — Intel Labs
Samuel Cerezo — Universidad Zaragoza
Yung-Hsiang Lu — Purdue University
Cang Ye — Professor, Department of Computer Science, Virginia Commonwealth University
Jason J. Corso — Professor of Robotics and EECS, University of Michigan and Chief Science Officer, Voxel51, Inc.
Cornelia Fermüller — Computer Vision Laboratory, UMIACS, University of Maryland
Michael Gleicher — Department of Computer Sciences, University of Wisconsin Madison
Amit K Roy-Chowdhury — Video Computing Group, Center for Robotics & Intelligent Systems, UC Riverside
Vinod Kulathumani — Amazon

Talks (13)

00:00:00 — Lars Hinneburg: Weakly Supervised End-to-End Deep Visual Odometry
- This talk presents a weakly supervised approach for deep visual odometry that reduces dependence on high-quality ground truth data by using estimated optical flow and RGB images, achieving state-of-the-art performance on the KITTI dataset.
00:11:54 — Sainan Liu: BAA-NGP: Bundle-Adjusting Accelerated Neural Graphics Primitives
- This talk introduces BAA-NGP, a novel method that combines BARF and Instant-NGP with NeRFACC sampling to simultaneously optimize camera poses and reconstruct 3D objects, achieving faster training times and supporting wider use cases compared to existing methods.
00:34:00 — Samuel Cerezo: Camera Motion Estimation from RGB-D-Inertial Scene Flow
- This talk presents a novel camera motion estimation method that leverages RGB-D-Inertial scene flow, combining visual and inertial data with a compact optimization approach to achieve accurate and robust pose estimation for autonomous navigation.
00:54:45 — Yung-Hsiang Lu: Efficient Computer Vision for Edge Devices
- This talk discusses the critical need for efficient computer vision on edge devices due to energy constraints, network limitations, and privacy concerns, and introduces various methods like quantization, pruning, and hierarchical neural networks to achieve this efficiency.
01:16:27 — Cang Ye: Visual-Inertial Odometry for Small-sized Robots
- This part introduces the motivation for visual-inertial odometry (VIO) in resource-constrained devices like smartphones and robot canes, highlighting the challenges posed by varying camera intrinsic parameters (CIP) due to optical image stabilization (OIS) in modern phone cameras. It presents a proposed Graph SLAM method, CIP-VMobile, which treats CIP as a state variable and uses a linear CIP-acceleration model for initial estimation.
01:28:59 — Cang Ye: Extended Version: Visual-LiDAR-Inertial Odometry
- This part discusses extending the VIO method to incorporate LiDAR data, creating a Visual-LiDAR-Inertial Odometry (VLIO) system. It details the characterization of LiDAR measurement errors (range, mixed pixels) and how these are handled through data filtering and a hybrid PnP method for reprojection error. Experimental results show improved accuracy with VLIO compared to VINS-RGBD.
01:39:24 — Cang Ye: Issues Related to Real-world Navigation Scenarios
- This part addresses practical challenges in real-world navigation, such as large-scale indoor environments requiring loop-closure detection, scene/object recognition to limit pose error, handling dynamic objects using semantic information and RANSAC, and the significant challenge posed by reflective surfaces for visual methods.
02:32:59 — Jason J. Corso: Everyday Depth: Towards Embodied AI with Depth from Standard Video Processing
- This talk explores the evolving dynamics of human+AI collaboration, focusing on the concept of the human as a ‘hazy oracle’ rather than an infallible source. It outlines the journey of integrating AI systems more deeply into practical applications through human+AI cooperation, discussing the potential value and challenges. The discussion includes the modeling of interaction errors and the strategic choices between immediate AI inference or seeking additional human input, supported by results from a user study on optimizing these collaborations.
03:49:22 — Cornelia Fermüller: Direct Approaches to Visual Navigation
- Introduction to direct approaches for visual navigation, covering classical motion estimation and visual SLAM.
03:54:28 — Cornelia Fermüller: Event Surfaces as a Geometrical Problem
- Conceptualization of event data as geometric surfaces in 3D space-time for motion analysis.
05:06:13 — Michael Gleicher: Miniature Time of Flight Sensors for Robot Perception
- This talk explores the use of small, inexpensive Time of Flight (ToF) sensors (SPADs) for robot perception, detailing their operation, challenges in interpreting their data, and advanced techniques like differentiable rendering to achieve accurate 3D reconstruction and surface property recovery.
05:11:17 — Amit K Roy-Chowdhury: Scene Understanding for Safe and Autonomous Navigation
- This talk presents a comprehensive approach to scene understanding for autonomous navigation, covering robust person detection and representation, multimodal semantic segmentation, dynamic scene graphs, and decision-making in complex, dynamic environments.
06:22:16 — Vinod Kulathumani: Mapping and Localization in Large Scale Retail Environments
- This talk presents Amazon’s Dash Cart technology, focusing on mapping, real-time localization, and product location inference within large-scale retail environments to enable customer-facing applications.

Key Takeaways

Weakly supervised visual odometry can achieve state-of-the-art performance on real-world datasets like KITTI by leveraging estimated optical flow and RGB images, reducing the need for high-quality ground truth data.
Combining advanced NeRF techniques (BARF, Instant-NGP, NeRFACC) can lead to significantly faster training and improved performance in 3D reconstruction and camera pose estimation, supporting broader applications.
Accurate camera motion estimation is crucial for autonomous navigation, and integrating RGB-D and IMU data with scene flow provides a robust and precise solution, especially when using compact optimization techniques like marginalization.
Efficient computer vision on edge devices is vital due to energy, network, and privacy constraints. Various methods like quantization, pruning, filter compression, network architecture search, and knowledge distillation can significantly reduce model size and computational cost while maintaining accuracy.
Varying Camera Intrinsic Parameters (CIP) due to Optical Image Stabilization (OIS) in smartphone cameras significantly impact VIO accuracy and must be explicitly modeled.
The proposed CIP-VMobile method, a Graph SLAM approach, effectively integrates CIP as a state variable, leading to improved pose estimation accuracy and computational efficiency.
Integrating LiDAR data into VIO (VLIO) further enhances performance, especially when addressing LiDAR-specific challenges like measurement characteristics (range, mixed pixels) and using robust data filtering.
Real-world VIO applications face significant challenges including large-scale indoor environments, dynamic objects, and reflective surfaces, requiring advanced techniques like loop-closure detection, object recognition, and robust feature handling.
The research aims to enrich physical platforms with additional information about the environment using minimal extra processing by leveraging existing detection and segmentation signals.
The proposed methods, including ODMD, offer fast inference and low-cost training data that generalizes across domains, providing depth estimation ‘for free’ from motion and detection/segmentation.
The approach addresses the challenge of integrating human and AI systems, particularly in scenarios where human input acts as a ‘hazy oracle,’ by modeling interaction errors and optimizing strategic choices for collaboration.
The ODMS dataset and trained models are available on GitHub, and the work has been benchmarked across various tasks and conditions, demonstrating robust performance.
Traditional motion estimation and SLAM approaches face challenges with fast motion, computational cost, and moving objects, necessitating new approaches for robust visual navigation.
Event-based cameras, particularly microsaccade-inspired designs (AMI-EV), offer significant advantages in high dynamic range, temporal resolution, and low latency, which are crucial for improving performance in various visual tasks for robotics.
Deep learning frameworks can be effectively used with event-based data to estimate optical flow, depth, and ego-motion, even in unsupervised settings and challenging conditions, leading to more robust and efficient robot control.
The proposed AMI-EV system demonstrates superior performance in low-level tasks like feature tracking and motion segmentation, and high-level tasks such as human detection, compared to conventional cameras and standard event cameras, especially in complex and high-speed scenarios.
Miniature Time of Flight (ToF) sensors, despite their low resolution and inherent ambiguities, can be effectively used for robot perception through careful modeling and advanced data processing techniques.
Directly utilizing transient histograms from ToF sensors, rather than relying on internal distance estimates, provides richer data for more accurate scene understanding.
Differentiable rendering and comparison allows for robust recovery of 3D geometry and surface properties (like albedo) by optimizing scene parameters to match simulated and captured sensor data.
Robust person detection and representation, especially under occlusion, is crucial for autonomous navigation and can be improved by combining pose estimation and segmentation models in a self-supervised framework like POISE.
Amazon’s Dash Cart utilizes mapping and localization to enable advanced customer-facing applications like product search, navigation, and location-aware recommendations in large retail stores.
The system employs a decoupled approach: offline map generation using SLAM (combining RGB, depth, and IMU data) and real-time relocalization via GPU-optimized RGB image matching.
Robustness at scale is achieved through dynamic failure detection and recovery strategies for SLAM, initiating new sessions and merging them to handle tracking errors in challenging environments.
Product locations (planograms) are inferred dynamically over time by clustering shopper purchase locations, providing an automated solution for maintaining up-to-date product maps.

Methods / Models / Datasets Mentioned

AMI-EV
AlexNet
BAA-NGP
BARF
BERT
BiLevelOpt
BoxLS
CIP-VMobile
DBox
DPVO
DSO
Deep VO
DeepL_UnDepVO
DeepLabV3
DeepVO
DiffPoseNet
Differentiable Render-and-Compare
EV-IMO
EVDodge
EVIMO2
Faster R-CNN
FbNet
FeatDepth
GELU
GLaM
GMflow
GPT-2
GPT-3
Gaussian Mixture Models
GeoNet
Geometric Calibration of Single-Pixel Distance Sensors
Graph SLAM
Hybrid PnP method
ICP
Instant-NGP
Kalman filter
LLaMA
LSD-SLAM
LaMDA
Laser-Beam Model
LeNet 300-100
LeNet-5
Linear CIP-acceleration model
LiteFlowNet
Loop Closure Detection
MnasNet
Mono-depth 2
NFlowNet
NeRF
NeRFACC
Neural Occupancy Field
ODMD
ODMS
ODN_d
ODN_l
ODN_n
ORB-SLAM
OpenMVG
POISE (Pose Guided Silhouette Estimation)
PWCNET
Pose2Sil
ProxylessNas
RANSAC
ReLU
S-EV
SC-SfMLearner
SIFT
SLAM
SPADs (Single Photon Avalanche Diodes)
SelFlow
TartanAir Dataset
TartanVO
Transient Histograms
Unsupervised Domain Adaptation (UDA)
VGG 16
VGG-144
VINS-Mobile
VINS-RGBD
VISO2-M
VMobile
VOS-DE
Vision-only SFM
Visual Odometry
Visual-Inertial Alignment

Topics

3D Reconstruction · Autonomous Navigation · Camera Intrinsic Parameters (CIP) · Camera Motion Estimation · Computer Vision · Dash Cart · Deep Learning · Depth Estimation · Differentiable Rendering · Edge Computing · Embodied AI · Event-based Cameras · Feature Tracking · Graph SLAM · Hierarchical Neural Networks · Human Detection · Human-AI Collaboration · Image Matching · Knowledge Distillation · LiDAR Integration · Localization · Mapping · Microsaccades · Model Optimization · Motion Estimation · Motion Segmentation · Neural Radiance Fields (NeRF) · Object Detection · Object Segmentation · Optical Flow · Optical Image Stabilization (OIS) · Person Detection · Pose Estimation · Product Localization · Pruning · Quantization · Real-time Systems · Real-world Challenges · Retail Environments · Robot Cane Navigation · Robot Perception · Robotics · SLAM · Scene Flow · Scene Understanding · Semantic Segmentation · Sim2Real · Standard Video Processing · Time of Flight Sensors · Visual Navigation · Visual Odometry · Visual-Inertial Odometry (VIO) · Weakly Supervised Learning

Notes

Open for commentary — connections to other work, critiques, follow-up reading.