All you need to know about self-driving: Intro to Self-Driving

Event: CVPR 2024 Tutorial · Duration: 512 min · ▶ Watch on YouTube

Abstract

This segment provides a comprehensive introduction to self-driving technology, starting with an overview of Waabi, a company specializing in autonomous trucks, and their recent significant funding round. It delves into the fundamental components of an autonomy stack, highlighting the challenges of autonomous driving through various complex scenarios. The presentation then critiques traditional modular and emerging end-to-end AI-first approaches, proposing Waabi’s interpretable, scalable, and AI-first solution centered around a foundation model. The segment also covers the critical role of simulation in development and deployment, followed by a detailed discussion on hardware and sensor configurations, including GNSS, LiDAR, RADAR, and various camera types, along with their respective strengths and limitations. Finally, it explores 3D perception tasks, different data representations for LiDAR and cameras, and various fusion strategies, emphasizing the crucial role of high-definition maps in providing rich prior information for robust perception. This segment covers various aspects of scene understanding for autonomous driving, focusing on HD maps, object detection, and the role of memory. It explores different representations for HD maps, including ground height, raster, and lane graphs, and discusses their strengths and weaknesses. The segment then delves into object detection frameworks, training methodologies, and the concept of occupancy (explicit and implicit) as an alternative output representation. A significant portion is dedicated to the importance of memory in perception, differentiating between multi-object tracking, object-level feature memory, and scene-level feature memory. Finally, it addresses the challenge of perceiving unknown objects through open-set segmentation, unsupervised detection, and language-meets-perception approaches, concluding with self-supervised implicit occupancy. This segment delves into the core components of motion planning and controls for self-driving vehicles. It begins by outlining various input representations, including HD Maps and vehicle intention, and then explores different output formats for motion planners, from direct actuation to cost volumes and auxiliary objectives. The discussion transitions into learning paradigms, contrasting open-loop and closed-loop approaches, and examining how techniques like behavior cloning, DAgger, policy distillation, and reinforcement learning are applied. A significant portion is dedicated to interpretability, showcasing models like Neural Motion Planner, DSDNet, P3, QuAD, and MP3, which integrate perception, prediction, and planning for enhanced transparency. Finally, the segment addresses the challenges of reactivity in uncertain environments and reviews control strategies such as Model Predictive Control and Feedforward/Feedback control, emphasizing the role of actuator characterization. This segment, presented by Andrei Barsan, provides a comprehensive overview of intelligent data mining for self-driving applications. It delves into the critical role of datasets in machine learning development and highlights the challenges of traditional data collection and labeling methods. The talk focuses on advanced data curation strategies, including rules-based and learned tagging, active learning for model improvement, and techniques to ensure data diversity and coverage, especially for capturing the “long tail” of rare scenarios. Furthermore, it explores scalable labeling approaches such as offboard auto-labeling and unsupervised object discovery, emphasizing the use of vision-language models and self-training to reduce reliance on costly human annotations. This segment provides a comprehensive overview of sensor simulation techniques for self-driving vehicles, covering LiDAR, Radar, and Camera modalities. It delves into various approaches, including physics-based rendering, data-driven methods, and advanced neural networks like NeRF and diffusion models, highlighting their respective advantages and limitations. The discussion also extends to vehicle platform modeling, detailing SDV dynamics and the importance of Hardware-in-Loop simulation for realistic closed-loop testing. This segment delves into the critical aspect of localization for autonomous vehicles, explaining its fundamental role in ensuring safety and performance. It covers various methods for measuring localization accuracy, including Absolute Trajectory Error and Failure Rate, and discusses the challenges posed by dynamic environments, sensor noise, and degenerate geometry. The segment further explores different online and global localization techniques, highlighting their strengths and limitations, and concludes with an outlook on future research directions in this field.

Speakers

Sergio Casas — Waabi
Andrei Bârsan — Waabi
Andrei Barsan
Sean Segal
Joyce Yang
Nikita Dvornik
Raquel Urtasun
Siva Manivasagam — Waabi

Talks (16)

00:00:00 — Sergio Casas: Intro to Self-Driving
- Introduction to Waabi, its mission in self-driving trucks, recent funding, and an overview of the tutorial’s comprehensive scope beyond typical computer vision perspectives.
00:25:21 — Andrei Bârsan: Hardware and Sensor Configurations
- An overview of various sensor types (GNSS, LiDAR, RADAR, Cameras, Event Cameras) used in autonomous driving, their strengths, limitations, and the importance of robust mechanical design for integration.
00:57:00 — Sergio Casas: 3D Perception
- Detailed exploration of 3D perception tasks, LiDAR and camera data representations, multi-sensor fusion techniques, and the role of HD maps in providing rich priors for autonomous driving.
01:25:17 — Sergio Casas: HD Maps: Ground height
- Discusses ground height maps, their construction (RANSAC, FCN), and their benefits for perception (priors, easier localization) and challenges (map changes).
01:30:22 — Sergio Casas: Object Level Feature Memory
- Explains object-level feature memory as a more flexible alternative to tracking, where past object features are fed as input to update the model, offering efficiency but still being object-level restricted.
01:35:24 — Sergio Casas: Conclusion
- Summarizes the key takeaways from the presentation, emphasizing the importance of rich scene representations, uncertainty modeling, and alternative representations like occupancy for autonomous driving.
02:50:35 — Andrei Barsan: HD Maps and Intention as Inputs
- Discusses HD Maps (Rasterization, Lane Graph, Affordances) and vehicle intention (routes, commands) as crucial inputs for motion planning, highlighting their pros and cons.
02:56:25 — Andrei Barsan: Designing Interpretable End-to-End Motion Planners
- Details various interpretable motion planners like NMP, DSDNet, P3, Implicit Occupancy (QuAD), and MP3, focusing on how they integrate perception, prediction, and planning for safer and more understandable autonomous driving.
03:01:50 — Andrei Barsan: Challenges and Solutions for Reactive Planning in Uncertain Environments
- Addresses challenges of planning in uncertain environments, including actor reactivity and unlikely events, and presents solutions like planning with reactive predictions and contingencies (LookOut).
04:15:52 — Andrei Barsan: All you need to know about self-driving Intelligent Data Mining
- This talk provides an overview of intelligent data mining for self-driving, covering data curation strategies (interestingness, model improvement, diversity) and scalable labeling techniques (auto-labeling, unsupervised discovery).
05:41:10 — Siva Manivasagam: LiDAR simulation: physics + data + neural nets
- Discussion of LiDAR simulation techniques, including ray casting, ML for noise, and generative models like LiDARDM.
05:46:22 — Siva Manivasagam: Camera simulation
- Introduction to camera simulation, briefly mentioning HD cameras and the image signal processing pipeline, and reviewing various simulation techniques.
05:51:35 — Siva Manivasagam: Camera simulation: neural radiance field (NeRF)
- Detailed compositional NeRF approaches for camera simulation in driving scenes, including UniSim, SUDS, MARS, StreetSurf, EmerNeRF, and Multi-level NSG.
05:56:35 — Siva Manivasagam: Summary of sensor simulation techniques
- Summary of sensor simulation techniques, highlighting the pros and cons of physics, data, and ML approaches.
06:01:35 — Siva Manivasagam: All you need to know about self-driving Simulation
- Answer to a question about NeRF extrapolation for large deviations from the original trajectory.
07:06:27 — Andrei Bârsan: All you need to know about self-driving Localization
- This segment provides a comprehensive overview of localization in self-driving cars, covering its importance, measurement metrics, challenges, and various online and global localization techniques.

Key Takeaways

Autonomous driving is a complex problem requiring robust solutions that go beyond traditional modular or purely end-to-end AI approaches, necessitating interpretable, scalable, and capital-efficient systems.
A diverse array of sensors (GNSS, LiDAR, RADAR, Cameras, Event Cameras) is crucial for comprehensive environmental perception, each with unique strengths and weaknesses that must be carefully considered for redundancy and reliability.
Simulation plays a pivotal role in the development and deployment of autonomous systems, allowing for extensive testing of rare and challenging scenarios that are difficult or dangerous to encounter in the real world.
Effective 3D perception relies on sophisticated data representations for LiDAR and cameras, coupled with advanced fusion techniques and the integration of high-definition maps to provide rich prior knowledge for robust and accurate environmental understanding.
HD maps provide crucial prior information for perception, with various representations offering different trade-offs in terms of semantic detail and computational efficiency.
Occupancy prediction, especially implicit models, offers a powerful and flexible way to represent the environment, naturally capturing complex shapes and uncertainty without relying on rigid bounding boxes.
Incorporating memory into perception systems, whether at the object or scene level, significantly enhances robustness to occlusion and allows for richer accumulation of sensor evidence over time.
Addressing unknown objects and open-set scenarios is critical for autonomous driving safety, with promising avenues in self-supervised learning and knowledge transfer from large language/vision models.
Various representations of HD Maps (Rasterization, Lane Graph, Affordances) offer different trade-offs in processing, receptive field, and prior incorporation for motion planning.
Motion planners can output direct actuation, trajectories, cost volumes, affordances, or auxiliary task predictions, each with distinct strengths in simplicity, interpretability, and prior knowledge integration.
Learning for motion planning involves open-loop (behavior cloning) and closed-loop (DAgger, policy distillation, RL) approaches, with closed-loop methods addressing distribution shift but requiring realistic simulation or interactive experts.
Interpretability in end-to-end motion planners can be achieved by predicting cost volumes, jointly reasoning about perception-prediction-planning, using scene-level occupancy representations, or leveraging implicit occupancy functions.
Data curation is crucial for self-driving ML, focusing on interesting, diverse, and model-improving data.
Traditional human labeling is expensive and unscalable, necessitating automated and intelligent labeling techniques.
Active learning and open-set tagging using vision-language models are promising for efficient data selection and handling novel scenarios.
Advanced auto-labeling pipelines, including those leveraging VLMs, can significantly reduce human intervention and improve training efficiency.
LiDAR simulation leverages physics, data, and neural networks, with neural rendering techniques like NeRF and 3DGS offering high-quality scene representation but facing challenges in generalization and rendering speed.
Radar simulation is particularly challenging due to noise and complex propagation effects, requiring specialized neural rendering approaches like DART and SAR-NeRF to model active sensors and Doppler information.
Camera simulation has seen rapid advancements, utilizing image warping, neural rendering (NeRF, 3DGS), scene editing, and generative models (GANs, diffusion models) to create realistic and diverse scenarios, though challenges remain in viewpoint extrapolation and real-time performance.
Vehicle platform modeling is crucial for closed-loop simulation, involving realistic SDV dynamics (kinematic and dynamic bicycle models) and Hardware-in-Loop (HIL) testing for accurate timing and system performance evaluation.
Localization is crucial for the safety and performance of autonomous vehicles, enabling precise positioning within HD maps.
Measuring localization accuracy involves metrics like Absolute Trajectory Error (ATE) and analyzing failure rates, with robustness being key due to downstream task dependencies.
Various online localization techniques, including semantic matching, geometric alignment, and neural fields, offer different trade-offs in terms of accuracy, computational cost, and robustness.
Global localization methods like RTK GPS, pose regression, and hybrid approaches are essential for initialization and recovery, addressing challenges like environmental changes and sensor noise.

Methods / Models / Datasets Mentioned

3D Voxels
A2D2
AIDE (Automatic Data Engine)
AMV-Bench
ATG4D
Absolute Trajectory Error (ATE)
Adaptive Feature Diffusion (AFD)
Affordances
AlignMIF
ArgoVerse
BEV (Bird's Eye View)
BEVFormer
Bayes Filtering
BeiDou
Bird's-Eye View (BEV)
Block-NeRF
Boreas
CARLA
CLIP
CLIP filtering
CNN
COCO
CRN
Cam-to-LiDAR Matching
Capilot4D
Cityscapes
ClimateNeRF
Common Crawl
Continual & Self-Training
Copilot4D
CyCADA
DART
DAgger
DSDNet
DVS
DeTra
Deep-drive
DeepLoc
Dense Captioning
DriveAgent
DriveDreamer
DriveGAN
DriveGPT4
DriveVLM
DrivingGaussian
Dynamic bicycle model
EmerNeRF
Event Cameras
Extended Kalman Filter (EKF)
FCN
FEGR
FMCW LiDAR
FMCW Radar
Feedforward/Feedback Control
Ford
GAIA-1
GLONASS
GNN
GNSS
GPS
GPT-4
GTA
GeoSim
Geometric Alignment
Global Navigation Satellite System (GNSS)
Hierarchical Localization
Histogram Filter
Hybrid Localization
Image Captioning
ImageNet
Implicit Occupancy
Implicito
Inertial Measurement Unit (IMU)
Information Gain
Inovis
Iterative Closest Point (ICP) algorithm
K-center problem
KITTI
KITTI-360
Kalman Filter
Kinematic bicycle model
Lane Graph
LiDAR
LiDAR Reflectance Matching
LiDAR-Physics Enhanced NeRF
LiDARDM
LightSim
Local Matching
LookOut
Lyft (Perception)
Lyft Prediction Dataset
MARS
MEMS LiDAR
MP3
MPC
Margin-based active learning
MemorySeg
MoDAR
Model Predictive Control
Multi Object Tracking
Multi-level NSG
NMP
NeRF
NeuRAD
NeuRas
Neural Fields (NeRF)
Neural LiDAR Fields
Neural Motion Planner
NeuroNCAP
Object Detection
Output Entropy
Oxford RobotCar
P3
PID controller
PIT30M
PLUMENet
PV-RCNN++
PVG
Pandaset
Point Set
Policy distillation
Pose Regression (PoseNet)
QuAD
RADAR
RANSAC
RLHF (Reinforcement Learning from Human Feedback)
RPVnet
RTK
Range-View (RV)
Rasterization
Real-Time Kinematic (RTK) GPS
Refinement Module
Retrieval-Based (NetVLAD)
Rules-Based Tagging Pipeline
S3Gaussian
SAR-NeRF
SPADE
ST-P3
SUDS
Semantic Matching
SemanticKITTI
Sora
StreetSurf
StrobE
Submanifold Sparse Convolutional Networks
Transformer
Uncertainty Selection
UniSim
UrbanIR
VISTA
VastGaussian
Velodyne
Waymo (Perception)
Waymo Motion Dataset
Wheel Encoders
World on Rails
Zenseact Open Dataset
Zero-shot detection
nuScenes
pix2pixHD
vid2vid

Topics

3D Gaussian Splatting · 3D Perception · AI-first approach · Absolute Trajectory Error · Active Learning · Auto-Labeling · Autonomous Driving · Autonomy stack · Camera simulation · Cameras · Controls · Data Curation · Data Labeling · Dataset Diversity · Failure rate · Geometric alignment · Global localization · Ground truth · HD Maps · HD maps · Hardware-in-Loop simulation · Hybrid localization · Implicit Neural Representations · Intelligent Data Mining · Interpretability · Learning · LiDAR · LiDAR reflectance matching · LiDAR simulation · Local matching · Localization · Memory in Perception · Motion Planning · Multi-Object Tracking · Neural fields · Neural rendering · Object Detection · Occupancy Prediction · Online localization · Open-Set Perception · Open-Set Tagging · Pose regression · RTK GPS · Radar simulation · Reactivity · Self-Driving Datasets · Self-Driving Vehicles · Self-driving cars · Self-driving trucks · Semantic matching · Sensor Fusion · Sensor fusion · Sensor fusion simulation · Simulation · Unsupervised Learning · Unsupervised Object Discovery · Vehicle dynamics modeling · Vision-Language Models

Notes

Open for commentary — connections to other work, critiques, follow-up reading.