All you need to know about self-driving: Intro to Self-Driving

Event: CVPR 2024 Tutorial · Duration: 512 min · ▶ Watch on YouTube

Abstract

This segment provides a comprehensive introduction to self-driving technology, starting with an overview of Waabi, a company specializing in autonomous trucks, and their recent significant funding round. It delves into the fundamental components of an autonomy stack, highlighting the challenges of autonomous driving through various complex scenarios. The presentation then critiques traditional modular and emerging end-to-end AI-first approaches, proposing Waabi’s interpretable, scalable, and AI-first solution centered around a foundation model. The segment also covers the critical role of simulation in development and deployment, followed by a detailed discussion on hardware and sensor configurations, including GNSS, LiDAR, RADAR, and various camera types, along with their respective strengths and limitations. Finally, it explores 3D perception tasks, different data representations for LiDAR and cameras, and various fusion strategies, emphasizing the crucial role of high-definition maps in providing rich prior information for robust perception. This segment covers various aspects of scene understanding for autonomous driving, focusing on HD maps, object detection, and the role of memory. It explores different representations for HD maps, including ground height, raster, and lane graphs, and discusses their strengths and weaknesses. The segment then delves into object detection frameworks, training methodologies, and the concept of occupancy (explicit and implicit) as an alternative output representation. A significant portion is dedicated to the importance of memory in perception, differentiating between multi-object tracking, object-level feature memory, and scene-level feature memory. Finally, it addresses the challenge of perceiving unknown objects through open-set segmentation, unsupervised detection, and language-meets-perception approaches, concluding with self-supervised implicit occupancy. This segment delves into the core components of motion planning and controls for self-driving vehicles. It begins by outlining various input representations, including HD Maps and vehicle intention, and then explores different output formats for motion planners, from direct actuation to cost volumes and auxiliary objectives. The discussion transitions into learning paradigms, contrasting open-loop and closed-loop approaches, and examining how techniques like behavior cloning, DAgger, policy distillation, and reinforcement learning are applied. A significant portion is dedicated to interpretability, showcasing models like Neural Motion Planner, DSDNet, P3, QuAD, and MP3, which integrate perception, prediction, and planning for enhanced transparency. Finally, the segment addresses the challenges of reactivity in uncertain environments and reviews control strategies such as Model Predictive Control and Feedforward/Feedback control, emphasizing the role of actuator characterization. This segment, presented by Andrei Barsan, provides a comprehensive overview of intelligent data mining for self-driving applications. It delves into the critical role of datasets in machine learning development and highlights the challenges of traditional data collection and labeling methods. The talk focuses on advanced data curation strategies, including rules-based and learned tagging, active learning for model improvement, and techniques to ensure data diversity and coverage, especially for capturing the “long tail” of rare scenarios. Furthermore, it explores scalable labeling approaches such as offboard auto-labeling and unsupervised object discovery, emphasizing the use of vision-language models and self-training to reduce reliance on costly human annotations. This segment provides a comprehensive overview of sensor simulation techniques for self-driving vehicles, covering LiDAR, Radar, and Camera modalities. It delves into various approaches, including physics-based rendering, data-driven methods, and advanced neural networks like NeRF and diffusion models, highlighting their respective advantages and limitations. The discussion also extends to vehicle platform modeling, detailing SDV dynamics and the importance of Hardware-in-Loop simulation for realistic closed-loop testing. This segment delves into the critical aspect of localization for autonomous vehicles, explaining its fundamental role in ensuring safety and performance. It covers various methods for measuring localization accuracy, including Absolute Trajectory Error and Failure Rate, and discusses the challenges posed by dynamic environments, sensor noise, and degenerate geometry. The segment further explores different online and global localization techniques, highlighting their strengths and limitations, and concludes with an outlook on future research directions in this field.

Speakers

  • Sergio Casas — Waabi
  • Andrei Bârsan — Waabi
  • Andrei Barsan
  • Sean Segal
  • Joyce Yang
  • Nikita Dvornik
  • Raquel Urtasun
  • Siva Manivasagam — Waabi

Talks (16)

  • 00:00:00 — Sergio Casas: Intro to Self-Driving
    • Introduction to Waabi, its mission in self-driving trucks, recent funding, and an overview of the tutorial’s comprehensive scope beyond typical computer vision perspectives.
  • 00:25:21Andrei Bârsan: Hardware and Sensor Configurations
    • An overview of various sensor types (GNSS, LiDAR, RADAR, Cameras, Event Cameras) used in autonomous driving, their strengths, limitations, and the importance of robust mechanical design for integration.
  • 00:57:00Sergio Casas: 3D Perception
    • Detailed exploration of 3D perception tasks, LiDAR and camera data representations, multi-sensor fusion techniques, and the role of HD maps in providing rich priors for autonomous driving.
  • 01:25:17Sergio Casas: HD Maps: Ground height
    • Discusses ground height maps, their construction (RANSAC, FCN), and their benefits for perception (priors, easier localization) and challenges (map changes).
  • 01:30:22Sergio Casas: Object Level Feature Memory
    • Explains object-level feature memory as a more flexible alternative to tracking, where past object features are fed as input to update the model, offering efficiency but still being object-level restricted.
  • 01:35:24Sergio Casas: Conclusion
    • Summarizes the key takeaways from the presentation, emphasizing the importance of rich scene representations, uncertainty modeling, and alternative representations like occupancy for autonomous driving.
  • 02:50:35Andrei Barsan: HD Maps and Intention as Inputs
    • Discusses HD Maps (Rasterization, Lane Graph, Affordances) and vehicle intention (routes, commands) as crucial inputs for motion planning, highlighting their pros and cons.
  • 02:56:25Andrei Barsan: Designing Interpretable End-to-End Motion Planners
    • Details various interpretable motion planners like NMP, DSDNet, P3, Implicit Occupancy (QuAD), and MP3, focusing on how they integrate perception, prediction, and planning for safer and more understandable autonomous driving.
  • 03:01:50Andrei Barsan: Challenges and Solutions for Reactive Planning in Uncertain Environments
    • Addresses challenges of planning in uncertain environments, including actor reactivity and unlikely events, and presents solutions like planning with reactive predictions and contingencies (LookOut).
  • 04:15:52Andrei Barsan: All you need to know about self-driving Intelligent Data Mining
    • This talk provides an overview of intelligent data mining for self-driving, covering data curation strategies (interestingness, model improvement, diversity) and scalable labeling techniques (auto-labeling, unsupervised discovery).
  • 05:41:10Siva Manivasagam: LiDAR simulation: physics + data + neural nets
    • Discussion of LiDAR simulation techniques, including ray casting, ML for noise, and generative models like LiDARDM.
  • 05:46:22Siva Manivasagam: Camera simulation
    • Introduction to camera simulation, briefly mentioning HD cameras and the image signal processing pipeline, and reviewing various simulation techniques.
  • 05:51:35Siva Manivasagam: Camera simulation: neural radiance field (NeRF)
    • Detailed compositional NeRF approaches for camera simulation in driving scenes, including UniSim, SUDS, MARS, StreetSurf, EmerNeRF, and Multi-level NSG.
  • 05:56:35Siva Manivasagam: Summary of sensor simulation techniques
    • Summary of sensor simulation techniques, highlighting the pros and cons of physics, data, and ML approaches.
  • 06:01:35Siva Manivasagam: All you need to know about self-driving Simulation
    • Answer to a question about NeRF extrapolation for large deviations from the original trajectory.
  • 07:06:27Andrei Bârsan: All you need to know about self-driving Localization
    • This segment provides a comprehensive overview of localization in self-driving cars, covering its importance, measurement metrics, challenges, and various online and global localization techniques.

Key Takeaways

  • Autonomous driving is a complex problem requiring robust solutions that go beyond traditional modular or purely end-to-end AI approaches, necessitating interpretable, scalable, and capital-efficient systems.
  • A diverse array of sensors (GNSS, LiDAR, RADAR, Cameras, Event Cameras) is crucial for comprehensive environmental perception, each with unique strengths and weaknesses that must be carefully considered for redundancy and reliability.
  • Simulation plays a pivotal role in the development and deployment of autonomous systems, allowing for extensive testing of rare and challenging scenarios that are difficult or dangerous to encounter in the real world.
  • Effective 3D perception relies on sophisticated data representations for LiDAR and cameras, coupled with advanced fusion techniques and the integration of high-definition maps to provide rich prior knowledge for robust and accurate environmental understanding.
  • HD maps provide crucial prior information for perception, with various representations offering different trade-offs in terms of semantic detail and computational efficiency.
  • Occupancy prediction, especially implicit models, offers a powerful and flexible way to represent the environment, naturally capturing complex shapes and uncertainty without relying on rigid bounding boxes.
  • Incorporating memory into perception systems, whether at the object or scene level, significantly enhances robustness to occlusion and allows for richer accumulation of sensor evidence over time.
  • Addressing unknown objects and open-set scenarios is critical for autonomous driving safety, with promising avenues in self-supervised learning and knowledge transfer from large language/vision models.
  • Various representations of HD Maps (Rasterization, Lane Graph, Affordances) offer different trade-offs in processing, receptive field, and prior incorporation for motion planning.
  • Motion planners can output direct actuation, trajectories, cost volumes, affordances, or auxiliary task predictions, each with distinct strengths in simplicity, interpretability, and prior knowledge integration.
  • Learning for motion planning involves open-loop (behavior cloning) and closed-loop (DAgger, policy distillation, RL) approaches, with closed-loop methods addressing distribution shift but requiring realistic simulation or interactive experts.
  • Interpretability in end-to-end motion planners can be achieved by predicting cost volumes, jointly reasoning about perception-prediction-planning, using scene-level occupancy representations, or leveraging implicit occupancy functions.
  • Data curation is crucial for self-driving ML, focusing on interesting, diverse, and model-improving data.
  • Traditional human labeling is expensive and unscalable, necessitating automated and intelligent labeling techniques.
  • Active learning and open-set tagging using vision-language models are promising for efficient data selection and handling novel scenarios.
  • Advanced auto-labeling pipelines, including those leveraging VLMs, can significantly reduce human intervention and improve training efficiency.
  • LiDAR simulation leverages physics, data, and neural networks, with neural rendering techniques like NeRF and 3DGS offering high-quality scene representation but facing challenges in generalization and rendering speed.
  • Radar simulation is particularly challenging due to noise and complex propagation effects, requiring specialized neural rendering approaches like DART and SAR-NeRF to model active sensors and Doppler information.
  • Camera simulation has seen rapid advancements, utilizing image warping, neural rendering (NeRF, 3DGS), scene editing, and generative models (GANs, diffusion models) to create realistic and diverse scenarios, though challenges remain in viewpoint extrapolation and real-time performance.
  • Vehicle platform modeling is crucial for closed-loop simulation, involving realistic SDV dynamics (kinematic and dynamic bicycle models) and Hardware-in-Loop (HIL) testing for accurate timing and system performance evaluation.
  • Localization is crucial for the safety and performance of autonomous vehicles, enabling precise positioning within HD maps.
  • Measuring localization accuracy involves metrics like Absolute Trajectory Error (ATE) and analyzing failure rates, with robustness being key due to downstream task dependencies.
  • Various online localization techniques, including semantic matching, geometric alignment, and neural fields, offer different trade-offs in terms of accuracy, computational cost, and robustness.
  • Global localization methods like RTK GPS, pose regression, and hybrid approaches are essential for initialization and recovery, addressing challenges like environmental changes and sensor noise.

Methods / Models / Datasets Mentioned

  • 3D Voxels
  • A2D2
  • AIDE (Automatic Data Engine)
  • AMV-Bench
  • ATG4D
  • Absolute Trajectory Error (ATE)
  • Adaptive Feature Diffusion (AFD)
  • Affordances
  • AlignMIF
  • ArgoVerse
  • BEV (Bird's Eye View)
  • BEVFormer
  • Bayes Filtering
  • BeiDou
  • Bird's-Eye View (BEV)
  • Block-NeRF
  • Boreas
  • CARLA
  • CLIP
  • CLIP filtering
  • CNN
  • COCO
  • CRN
  • Cam-to-LiDAR Matching
  • Capilot4D
  • Cityscapes
  • ClimateNeRF
  • Common Crawl
  • Continual & Self-Training
  • Copilot4D
  • CyCADA
  • DART
  • DAgger
  • DSDNet
  • DVS
  • DeTra
  • Deep-drive
  • DeepLoc
  • Dense Captioning
  • DriveAgent
  • DriveDreamer
  • DriveGAN
  • DriveGPT4
  • DriveVLM
  • DrivingGaussian
  • Dynamic bicycle model
  • EmerNeRF
  • Event Cameras
  • Extended Kalman Filter (EKF)
  • FCN
  • FEGR
  • FMCW LiDAR
  • FMCW Radar
  • Feedforward/Feedback Control
  • Ford
  • GAIA-1
  • GLONASS
  • GNN
  • GNSS
  • GPS
  • GPT-4
  • GTA
  • GeoSim
  • Geometric Alignment
  • Global Navigation Satellite System (GNSS)
  • Hierarchical Localization
  • Histogram Filter
  • Hybrid Localization
  • Image Captioning
  • ImageNet
  • Implicit Occupancy
  • Implicito
  • Inertial Measurement Unit (IMU)
  • Information Gain
  • Inovis
  • Iterative Closest Point (ICP) algorithm
  • K-center problem
  • KITTI
  • KITTI-360
  • Kalman Filter
  • Kinematic bicycle model
  • Lane Graph
  • LiDAR
  • LiDAR Reflectance Matching
  • LiDAR-Physics Enhanced NeRF
  • LiDARDM
  • LightSim
  • Local Matching
  • LookOut
  • Lyft (Perception)
  • Lyft Prediction Dataset
  • MARS
  • MEMS LiDAR
  • MP3
  • MPC
  • Margin-based active learning
  • MemorySeg
  • MoDAR
  • Model Predictive Control
  • Multi Object Tracking
  • Multi-level NSG
  • NMP
  • NeRF
  • NeuRAD
  • NeuRas
  • Neural Fields (NeRF)
  • Neural LiDAR Fields
  • Neural Motion Planner
  • NeuroNCAP
  • Object Detection
  • Output Entropy
  • Oxford RobotCar
  • P3
  • PID controller
  • PIT30M
  • PLUMENet
  • PV-RCNN++
  • PVG
  • Pandaset
  • Point Set
  • Policy distillation
  • Pose Regression (PoseNet)
  • QuAD
  • RADAR
  • RANSAC
  • RLHF (Reinforcement Learning from Human Feedback)
  • RPVnet
  • RTK
  • Range-View (RV)
  • Rasterization
  • Real-Time Kinematic (RTK) GPS
  • Refinement Module
  • Retrieval-Based (NetVLAD)
  • Rules-Based Tagging Pipeline
  • S3Gaussian
  • SAR-NeRF
  • SPADE
  • ST-P3
  • SUDS
  • Semantic Matching
  • SemanticKITTI
  • Sora
  • StreetSurf
  • StrobE
  • Submanifold Sparse Convolutional Networks
  • Transformer
  • Uncertainty Selection
  • UniSim
  • UrbanIR
  • VISTA
  • VastGaussian
  • Velodyne
  • Waymo (Perception)
  • Waymo Motion Dataset
  • Wheel Encoders
  • World on Rails
  • Zenseact Open Dataset
  • Zero-shot detection
  • nuScenes
  • pix2pixHD
  • vid2vid

Topics

3D Gaussian Splatting · 3D Perception · AI-first approach · Absolute Trajectory Error · Active Learning · Auto-Labeling · Autonomous Driving · Autonomy stack · Camera simulation · Cameras · Controls · Data Curation · Data Labeling · Dataset Diversity · Failure rate · Geometric alignment · Global localization · Ground truth · HD Maps · HD maps · Hardware-in-Loop simulation · Hybrid localization · Implicit Neural Representations · Intelligent Data Mining · Interpretability · Learning · LiDAR · LiDAR reflectance matching · LiDAR simulation · Local matching · Localization · Memory in Perception · Motion Planning · Multi-Object Tracking · Neural fields · Neural rendering · Object Detection · Occupancy Prediction · Online localization · Open-Set Perception · Open-Set Tagging · Pose regression · RTK GPS · Radar simulation · Reactivity · Self-Driving Datasets · Self-Driving Vehicles · Self-driving cars · Self-driving trucks · Semantic matching · Sensor Fusion · Sensor fusion · Sensor fusion simulation · Simulation · Unsupervised Learning · Unsupervised Object Discovery · Vehicle dynamics modeling · Vision-Language Models


Notes

Open for commentary — connections to other work, critiques, follow-up reading.