Generating The Invisible: Capturing and Generating Edge-cases in Autonomous Driving

Event: CVPR 2024 Workshop · Duration: 556 min · ▶ Watch on YouTube

Abstract

This segment features multiple talks from the CVPR 2024 workshop on ‘Generating The Invisible: Capturing and Generating Edge-cases in Autonomous Driving.’ Felix Heide from Torc Robotics and Princeton U. discusses their end-to-end differentiable AV stack, emphasizing the role of generative AI and neural rendering in scalable simulation, handling edge cases, and explainable multi-object tracking for autonomous trucks. Siva Manivasagam from Waabi presents Waabi World, a high-fidelity, closed-loop simulator designed for safe self-driving development, detailing its capabilities in digital twin creation, sensor simulation, and robust evaluation. Haowei Sun from the University of Michigan introduces Dense Deep Reinforcement Learning (D2RL) within a Naturalistic and Adversarial Driving Environment (NADE) for efficient safety validation of autonomous vehicles by focusing on rare, safety-critical events. Hanfeng Wu from ETH Zurich presents a method for dynamic LIDAR re-simulation using compositional neural fields to improve geometry reconstruction and reduce the reality-simulation domain gap. This segment features a series of lightning talks from the CVPR 2024 Workshop on Data-Driven Autonomous Driving Simulation. Presentations cover diverse topics including holistic urban 3D scene understanding via Gaussian Splatting, multi-level neural scene graphs for dynamic urban environments, and Lidar-enhanced neural radiance fields for street scenes. Further talks delve into transformer-based generative models for multi-agent traffic simulation and synthesizing simulation environments with generative models. The segment concludes with a keynote on machine learning for realistic and efficient driving simulation, showcasing Waymo’s advancements in sensor and traffic simulation. This segment delves into advanced traffic simulation techniques, focusing on the “Scene Diffuser” model which leverages diffusion models for both scene initialization and rollout. It introduces the “Scene Tensor” concept, enabling various tasks like behavior prediction and scene generation through in-painting. A key innovation is the integration of generalized hard constraints during the diffusion process to enhance realism and the use of amortized diffusion for improved closed-loop motion generation efficiency. The talk also explores the critical challenge of “Agent Tail Realism” and how inference-time constraints, including reaction and dynamic constraints, can be applied to generate and test rare, high-risk scenarios, even using LLMs for prompt-based scene control. This segment introduces the concept of valid human agent models for autonomous driving (AD) simulations, emphasizing the need to accurately represent human behavior and cognitive processes. The speaker highlights the challenges and importance of developing realistic human models to ensure the reliability and safety testing of AD systems. The presentation delves into various behavioral phenomena and cognitive mechanisms that influence human driving, advocating for their integration into simulation environments. This segment features a series of lightning talks on data-driven autonomous driving simulation. Topics range from the importance of modeling human behavior and cognitive mechanisms in AD testing, to the development of neural rendering techniques for generating realistic and safety-critical scenarios. Speakers also present advancements in reinforcement learning for autonomous driving, vision-language models for complex scene understanding, and novel 3D scene reconstruction methods using Gaussian splatting. The segment concludes with a discussion on editable scene simulation using LLM-agents and perceiving 3D scenes from single-glance images through neural field distillation. This segment features two talks on the future of embodied AI, with a strong focus on autonomous driving. Jamie Shotton from Wayve discusses the limitations of traditional AV stacks and introduces Wayve’s end-to-end AI approach, leveraging advanced simulation techniques like PRISM-1 and multimodal foundation models like GAIA and LINGO-2 to achieve generalization, safety, and human-like understanding in complex driving scenarios. Kashyap Chitta then introduces his work on synthesizing simulation environments with generative models, highlighting the importance of graphics simulators in autonomous driving research. This segment introduces SLEDGE, a novel generative model-based simulator for autonomous driving. It highlights the limitations of traditional graphics and log replay simulators, such as high computational cost and limited scenario diversity. SLEDGE addresses these issues by synthesizing realistic and diverse driving scenes, enabling arbitrary duration and routes, and offering a compact representation. The talk details the technical approach, including raster-to-vector autoencoders and latent diffusion transformers, and demonstrates SLEDGE’s capabilities in HD-map generation, agent inpainting, and autoregressive map and agent generation, showcasing its potential for rigorous testing and development of autonomous driving algorithms.

Speakers

Felix Heide — Torc Robotics & Princeton U.
Siva Manivasagam — Waabi, Head of Sensor Simulation
Haowei Sun — University of Michigan
Hanfeng Wu — ETH Zurich
Yiyi Liao — Zhejiang University
Hongyu Zhou — Zhejiang University
Jiahao Shao — Zhejiang University
Lu Xu — Huawei Noah’s Ark Lab
Dongfeng Bai — University of Tübingen
Weichao Qiu — Tübingen AI Center
Bingbing Liu — Tübingen AI Center
Yue Wang — Zhejiang University
Andreas Geiger — University of Tübingen
Tobias Fischer — ETH Zürich, Meta Reality Labs
Lorenzo Porzi — Meta Reality Labs
Samuel Rota Bulò — Meta Reality Labs
Marc Pollefeys — ETH Zürich
Peter Kontschieder — Meta Reality Labs
Shanlin Sun — UCI, NEC Laboratories America
Bingbing Zhuang — NEC Laboratories America
Ziyu Jiang — UC San Diego
Buyu Liu — UC San Diego
Xiaohui Xie — UCI
Manmohan Chandraker — UC San Diego, NEC Laboratories America
Tiebiao Zhao — Nvidia
Yu Wang — Pegasus
Fan Yi — Nvidia
Guangzhi Cao — ZDrive.ai
Kashyap Chitta — University of Tübingen
Drago Anguelov — VP, Head of Research Waymo
Dragomir Anguelov — Waymo
Prof. Gustav Markkula — Chair in Applied Behaviour Modelling, Institute for Transport Studies, University of Leeds
Gustav Markkula — University of Leeds
Adam Tonderski — Zenseact, Chalmers University, Lund University, WASP
Carl Lindström — Zenseact, Chalmers University, Lund University, WASP
Georg Hess — Zenseact, Chalmers University, Lund University, WASP
William Ljungbergh — Zenseact, Chalmers University, Lund University, WASP
Lennart Svensson — Zenseact, Chalmers University, Lund University, WASP
Christoffer Petersson — Zenseact, Chalmers University, Lund University, WASP
Joakim Johnander — Zenseact, Chalmers University, Lund University, WASP
Holger Caesar — Zenseact, Chalmers University, Lund University, WASP
Kalle Åström — Zenseact, Chalmers University, Lund University, WASP
Michael Felsberg — Zenseact, Chalmers University, Lund University, WASP
Moritz Harmel — Zoox
Anubhav Paras — Zoox
Andreas Pasternak — Zoox
Nicholas Roy — Zoox
Gary Linscott — Zoox
Katrin Renz — Eberhard Karls Universität Tübingen
Chonghao Sima — Eberhard Karls Universität Tübingen
Hang Zhao — Tsinghua University
Xiaoyu Tian — Tsinghua University
Junru Gu — Tsinghua University
Yicheng Liu — Tsinghua University
Xiaoyu Zhou — Peking University
Zhiwei Lin — Peking University
Xiaojun Shan — Peking University
Yongtao Wang — Peking University
Deqing Sun — Google Research
Ming-Hsuan Yang — University of California, Merced
Yuxi Wei — Shanghai Jiao Tong University
Zi Wang — Shanghai Jiao Tong University
Yifan Lu — Shanghai Jiao Tong University
Chenxin Xu — Shanghai Jiao Tong University
Changxing Liu — Shanghai Jiao Tong University
Hao Zhao — Shanghai Jiao Tong University
Siheng Chen — Shanghai Jiao Tong University
Yanfeng Wang — Shanghai Jiao Tong University
Letian Wang — University of Toronto
Seung Wook Kim — University of Toronto
Jiawei Yang — University of Toronto
Cunjun Yu — University of Toronto
Boris Ivanovic — University of Toronto
Steven Waslander — University of Toronto
Sanja Fidler — University of Toronto
Marco Pavone — University of Toronto
Peter Karkus — University of Toronto
Jamie Shotton — Chief Scientist, Wayve

Talks (24)

00:00:00 — Felix Heide: Generating The Invisible: Capturing and Generating Edge-cases in Autonomous Driving
- This segment introduces the workshop agenda and the first speaker, Felix Heide, who will discuss generative AI for autonomous driving.
01:19:28 — Yiyi Liao: HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting
- This talk presents HUGS, a method for holistic urban 3D scene understanding using Gaussian Splatting, which extends 3D Gaussians with multi-modal information and utilizes a unicycle model for robust object motion optimization, enabling real-time rendering and 3D semantic reconstruction from single RGB video inputs.
01:19:28 — Tobias Fischer: Multi-Level Neural Scene Graphs for Dynamic Urban Environments
- This presentation introduces a scalable, multi-level neural scene graph representation for dynamic urban environments, enabling fast training and rendering via composite ray sampling and offering powerful scene editing capabilities, while also proposing a benchmark for radiance field reconstruction from heterogeneous vehicle captures.
01:19:28 — Shanlin Sun: LidarRF: Delving into Lidar for Neural Radiance Field on Street Scenes
- This talk introduces LidarRF, a novel approach that leverages Lidar data to enhance Neural Radiance Fields for photorealistic street scene simulation, incorporating Lidar encoding, robust depth supervision with curriculum learning, and augmented view supervision to improve reconstruction quality and address challenges in sparse data regions.
01:19:28 — Tiebiao Zhao: Multiverse Transformer: Advancing Closed-Loop Multi-Agent Simulation with Generative Model
- This presentation introduces the Multiverse Transformer, a transformer-based generative model for closed-loop multi-agent traffic simulation, which generates diverse parallel universes of driving scenarios, utilizes a receding prediction horizon for multi-modal diversity, and achieved top performance in the Waymo Open Sim Agents Challenge.
01:19:28 — Kashyap Chitta: Synthesizing Simulation Environments with Generative Models
- This talk presents SLEDGE, a generative model for synthesizing simulation environments, focusing on HD-Map generation and agent inpainting, demonstrating state-of-the-art results in creating diverse and realistic traffic scenarios for autonomous driving simulation.
01:19:28 — Drago Anguelov: ML for Realistic and Efficient Driving Simulation
- This talk discusses Waymo’s experience in autonomous driving, highlighting the challenges of complex, high-dimensional inputs and real-time latency requirements, and presents their machine learning approaches for realistic and efficient driving simulation, including advancements in sensor simulation using 3D Gaussian Splatting and diffusion models for traffic simulation.
02:38:57 — Dragomir Anguelov: Machine Learning for Realistic and Efficient Simulation
- This segment introduces the Scene Diffuser model for traffic simulation, detailing its use of diffusion models for scene initialization and rollout, incorporating hard constraints, and leveraging amortized diffusion for efficient closed-loop motion generation. It also explores controllability through inference-time constraints and LLMs, addressing the challenge of generating rare agent behaviors.
02:55:00 — Felix Heide: Generating the Invisible: Generative AI for Scalable Autonomous Driving
- This talk introduces Torc’s end-to-end differentiable AV stack, highlighting how generative AI and neural rendering are used for scalable simulation, sensor calibration, edge case handling, and explainable multi-object tracking in autonomous trucking.
03:58:26 — Prof. Gustav Markkula: Valid human agents in simulated AD testing: Behavioural phenomena and cognitive mechanisms
- This talk discusses the importance of valid human agent models in autonomous driving simulations, focusing on behavioral phenomena and cognitive mechanisms to ensure realistic and reliable testing.
03:59:59 — Siva Manivasagam: Generative AI for Developing and Deploying Self-driving Systems Safely
- This presentation details Waabi World, a high-fidelity, closed-loop, end-to-end simulator leveraging generative AI for safe and scalable self-driving development, focusing on digital twin creation, sensor simulation, and robust evaluation metrics.
05:18:00 — Gustav Markkula: Valid human agents in simulated AD testing: Behavioural phenomena and cognitive mechanisms
- Discusses the importance of modeling human behavior in AD testing, focusing on behavioral phenomena and cognitive mechanisms, and how high-level metrics are not always sufficient.
05:23:55 — Adam Tonderski: NeuRAD: Neural Rendering for Autonomous Driving
- Presents NeuRAD, a neural rendering method for autonomous driving, detailing its architecture, requirements, and state-of-the-art performance in generating realistic sensor data for AD scenarios.
05:28:00 — William Ljungbergh: Neural Rendering for Safety-critical Autonomous Driving Simulation
- Explains how NeuRAD is used in a closed-loop NeuroNCAP simulation engine to evaluate AD systems in safety-critical scenarios, highlighting the poor performance of current E2E planners.
05:33:00 — Moritz Harmel: Scaling Is All You Need: Autonomous Driving with JAX-Accelerated Reinforcement Learning
- Discusses using JAX-accelerated reinforcement learning in a realistic simulator to train autonomous driving policies, demonstrating improved safety and progress metrics through large-scale training.
05:38:00 — Katrin Renz: DriveVLM: Driving with Graph Visual Question Answering
- Introduces DriveVLM, a visual-language model that uses graph-based visual question answering for driving, emphasizing its potential for generalization and explainability in complex scenarios.
05:43:00 — Hang Zhao: DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
- Presents DriveVLM, a vision-language model for autonomous driving, and proposes a Dual System architecture to combine its reasoning capabilities with traditional pipelines for robust performance in long-tail scenarios.
05:48:00 — Xiaoyu Zhou: DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes
- Introduces DrivingGaussian, a 3DGS framework for reconstructing and rendering complex dynamic driving scenes from multi-sensor data, achieving photorealistic quality and enabling corner case simulation.
05:53:00 — Yuxi Wei: Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents
- Presents ChatSim, a language-controlled photorealistic driving scene simulation system that uses collaborative LLM-agents and advanced rendering techniques for easy and flexible scene editing.
05:58:00 — Letian Wang: DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features
- Introduces DistillNeRF, a method to reconstruct 3D scenes from a single image by distilling knowledge from pre-trained NeRFs and foundation models, achieving 3D consistency and generalization.
06:39:24 — Jamie Shotton: The Road to Embodied AI
- Jamie Shotton discusses the challenges and opportunities of embodied AI, particularly in the context of autonomous driving, highlighting Wayve’s end-to-end AI approach, simulation, and multimodal foundation models.
07:56:53 — Kashyap Chitta: SLEDGE: Synthesizing Simulation Environments for Driving Agents with Generative Models
- This talk introduces SLEDGE, a generative model-based simulator for autonomous driving that synthesizes diverse and realistic driving scenarios, offering advantages over traditional graphics and log replay simulators.
09:59:59 — Haowei Sun: Dense Reinforcement Learning for Safety Validation of Autonomous Vehicles
- This talk introduces Dense Deep Reinforcement Learning (D2RL) within a Naturalistic and Adversarial Driving Environment (NADE) to efficiently validate autonomous vehicles by focusing on rare, safety-critical events through Markov process editing.
15:29:59 — Hanfeng Wu: Dynamic LIDAR Re-simulation using Compositional Neural Fields
- This presentation introduces a method for dynamic LIDAR re-simulation using compositional neural fields to enhance geometry reconstruction, lower the domain gap between real and simulated data, and enable rich scene editing capabilities for dynamic driving scenarios.

Key Takeaways

Generative AI and neural rendering are crucial for creating scalable and realistic simulations to address the vast number of edge cases in autonomous driving.
End-to-end differentiable AV stacks, from raw sensor data to planning, allow for comprehensive optimization and improved robustness in complex driving scenarios.
Advanced simulation platforms like Waabi World aim to provide high-fidelity, closed-loop testing environments that are diverse, fast, controllable, and realistic, reducing reliance on extensive real-world mileage.
Novel approaches like D2RL and compositional neural fields are being developed to enhance the efficiency of safety validation, improve sensor simulation realism, and bridge the reality-simulation domain gap for autonomous systems.
Advanced neural field-based methods, including 3D Gaussian Splatting, are crucial for achieving real-time, high-fidelity urban scene understanding and simulation, addressing limitations of previous NeRF-based approaches.
Integrating Lidar data through techniques like Lidar encoding and robust depth supervision significantly enhances the quality of 3D semantic reconstruction and rendering, particularly for complex street scenes.
Generative models, such as the Multiverse Transformer and diffusion models, are proving effective in creating diverse, realistic, and controllable multi-agent traffic simulations, which are essential for scalable system validation in autonomous driving.
Waymo’s extensive experience in autonomous driving highlights the importance of machine learning for both realistic sensor simulation and efficient traffic simulation, leveraging large datasets and advanced models to improve safety and performance.
Diffusion models offer a unified and flexible framework for both initializing and rolling out traffic scenes, treating various tasks as in-painting problems on a ‘Scene Tensor’.
Integrating generalized hard constraints directly into the diffusion process during inference significantly improves the realism of generated traffic scenarios by preventing physically impossible or unnatural behaviors.
Amortized diffusion provides a more efficient approach for closed-loop motion generation, substantially closing the performance gap with open-loop models while reducing computational cost.
Controllability, including the generation of rare and challenging ‘agent tail realism’ scenarios, can be achieved by specifying inference-time constraints, with potential for natural language interaction via LLMs.
Accurate human agent models are crucial for valid and reliable testing of autonomous driving systems in simulation.
Understanding and integrating complex human behavioral phenomena and cognitive mechanisms is key to developing effective human models for AD simulations.
The development of data-driven autonomous driving simulations requires robust human models to ensure that testing scenarios reflect real-world interactions and challenges.
Modeling human behavioral phenomena and cognitive mechanisms is crucial for validating autonomous driving systems, as high-level performance metrics alone may not capture critical aspects of human-like interaction.
Neural rendering techniques like NeuRAD offer a promising avenue for creating photorealistic and controllable simulations of safety-critical driving scenarios, enabling robust evaluation of AD systems in closed-loop environments.
Scaling reinforcement learning with JAX-accelerated simulators and large datasets can significantly improve the safety and performance of autonomous driving policies, demonstrating the potential for RL to surpass human driving capabilities.
Vision-language models (VLMs) like DriveVLM and ChatSim are emerging as powerful tools for autonomous driving, offering capabilities for holistic scene understanding, reasoning, and language-controlled simulation editing, addressing challenges in generalization and explainability.
Embodied AI, particularly in autonomous driving, is a rapidly advancing field with the potential to transform human-technology interactions, moving beyond traditional AI tasks.
End-to-end AI systems, leveraging simulation and multimodal foundation models, offer a promising path to overcome the limitations of traditional, modular AV stacks by providing computational homogeneity, generalization through data, scalability, and superior safety.
Generative models and neural rendering are crucial for creating diverse, dynamic, and controllable simulation environments necessary for training and validating autonomous systems, especially for handling complex edge cases and enabling counterfactual testing.
Integrating language with vision and action through models like LINGO-2 allows for more explainable, intelligent, and trustworthy autonomous systems, enabling human-like understanding and interaction with the physical world.
Traditional graphics and log replay simulators for autonomous driving have limitations in terms of scenario diversity, computational cost, and reproducibility, hindering comprehensive testing and development.
Generative models, specifically latent diffusion transformers, can synthesize realistic and diverse driving scenes, offering a compact representation and enabling arbitrary duration and routes for simulations.
SLEDGE, a generative model-based simulator, leverages raster-to-vector autoencoders and transformer decoders to generate vector-based scene elements, providing enhanced controllability and enabling tasks like HD-map generation, agent inpainting, and spatial outpainting.
Long-duration simulations enabled by generative models are crucial for exposing failures in state-of-the-art planning algorithms, demonstrating the need for more diverse and controllable simulation environments beyond what traditional methods offer.

Methods / Models / Datasets Mentioned

2D encoder
3D Asset Management
3D Gaussian Splatting (3DGS)
3D decoder
3D perception
Accelerated driving simulator
Action-sensitive theory of mind
AdaLN
Algolux
AutoBots
BERT
BLIP-2 Q-Former
Background Rendering
Bayesian perceptual filtering
Behavioral Cloning
BlockNeRF
CARLA
CLIP
COMPASS
ChatSim
Closed-loop simulation
Collision severity score
Composite Gaussian Splatting (3DGS)
Ctrl-Sim
D2RL
DETR
DINOv2
Depth estimation
DiT
Diffusion Models
DistillNeRF
Distributed RL
DriveLM-Data
DriveVLM
DriveVLM-Dual
DrivingGaussian
Dual System for Autonomous Driving
DyNFL
Dynamic Gaussian Graph
Euro NCAP scenarios
Evidence accumulation
Foreground Rendering
Foundation Models
GAIA
GET3D
Gato
GaussianPro
Ghost Gym
Global rendering
Graph Visual Question Answering
HDMapGen
Hybrid mechanistic/ML modeling
HyperNeRF
ISP
Image Diff.
Incremental static 3D Gaussians
Instant-NGP
JAX
Knowledge distillation
LIDARSim
LINGO-1
LINGO-2
LLM-Agents
LLMs
LLaMA
LM-Nav
LOFTR + MAGSAC
LORA finetuning
Lidar prior
Lidar ray drop
LidarRF
Lift-Splat-Shoot
Long-term estimation of action values
MARS
MTR
McLight (Lighting Estimation)
McNeRF (Multi-Camera NeRF)
Meta-actions
Mip-NeRF360
Motion prediction
MotionDiffuser
MotionLM
Multi-Agent Collaboration Framework
Multi-camera
MultiVerse Transformer
NADE
NFL2
NLFE2+
NLL
NSG
NeRF
Nerfacto
Nerfies
NeuRAD
NeuRas
Neural feature field
NeuroNCAP simulation engine
Nocturne
Novel view synthesis
Open X-Embodiment
Open-loop evaluation
OpenDV-2k
PDM-Closed
PPO
PRISM-1
Perceiver IO
Prompt tuning
Proposal sampling
RMSE
RT-1
RVAE
Reference trajectories
ResNet
Retinal sensory noise
Rolling shutter handling
Rule-based QA generation
SLEDGE
SOLD-Net
SUDS
Scene Diffuser
Sensor embedding
Single-image 3D reconstruction
System 1 brain
System 2 brain
TEDi
TORCS
TeraSim
Traditional self-driving pipeline
TrafficSim
Trajectory planning
UniAD
UniSim
Urban Radiance Fields
V-trace off-policy correction
VAD
VISTA
VQ-GAN
Vehicle Deleting
Vehicle Motion
ViT
Vicuna-7B
View Adjustment
Visual looming
Volume rendering
WOSAC metrics
Waabi World
Waymax
Waymo's Waymax
autoregressive transformer
diffusion video decoder
nuPlan
nuScenes

Topics

3D Reconstruction · 3D Scene Reconstruction · AD Testing · Agent Inpainting · Agent Tail Realism · Amortized Diffusion · Autonomous Driving · Autonomous Driving Simulation · Behavioral Phenomena · Cognitive Mechanisms · Controllability · Counterfactual Testing · Data-driven approaches · Diffusion Models · Digital Twins · Domain Gap · Edge Cases · Embodied AI · End-to-end AI · End-to-end AV Stacks · Foundation Models · Gaussian Splatting · Generative AI · Generative Models · HD-Map Generation · Hard Constraints · Human Agent Models · Human Behavior Modeling · Human-in-the-loop · Knowledge Distillation · LIDAR Simulation · LLM-Agents · LLM-based Scene Control · Language Models · Latent Diffusion Transformer · Lidar · Log Replay Simulators · Machine Learning · Motion Generation · Multi-object Tracking · Multimodality · Neural Radiance Fields · Neural Rendering · Reinforcement Learning · Safety Validation · Safety-Critical Scenarios · Scene Generation · Scene Reconstruction · Scene Tensor · Scene Understanding · Sensor Simulation · Simulation · Simulation Validity · Traffic Simulation · Vision-Language Models

Notes

Open for commentary — connections to other work, critiques, follow-up reading.