CVPR 2024 Workshop on Autonomous Driving

Event: CVPR 2024 Workshop · Duration: 495 min · ▶ Watch on YouTube

Abstract

This segment opens with a welcome and overview of the CVPR 2024 Workshop on Autonomous Driving, detailing its structure, keynote speakers, challenges, and paper track. It then transitions into the first keynote address by Bolei Zhou from UCLA, who discusses the critical need for an open-source simulation ecosystem to bridge the widening gap between industry and academia in autonomous driving research. He introduces MetaDriveVerse, a platform designed to integrate diverse real-world datasets and advanced simulators, enabling the generation and testing of safety-critical scenarios and fostering new research opportunities in areas like multi-agent learning and urban mobility. This segment features presentations on cutting-edge research and challenges in autonomous driving perception. Sanja Fidler from NVIDIA discusses the development of next-gen AVs powered by foundation models, emphasizing a full-stack approach from hardware to simulation. Following this, the Argoverse Competitions 2024 are introduced, detailing challenges in (Un)supervised Scene Flow, End-to-End Forecasting, Multi-Agent Motion Forecasting, and 4D Occupancy Forecasting. Speakers present the problem setups, evaluation metrics, and winning methodologies for each challenge, highlighting innovative architectures, training strategies, and the importance of robust simulation for advancing autonomous driving technology. This segment provides an overview of the Argoverse 2024 forecasting challenges, covering both multi-agent and single-agent tasks. It details the problem setups, evaluation metrics, and highlights the winning methods, Lite-QCNet and FutureNet-LOF, along with their architectural innovations. The segment then introduces the 4D Occupancy Forecasting challenge, explaining its self-supervised nature, evaluation methodology, and presenting the winning (UnO) and runner-up (NLK) solutions, emphasizing the role of BEV feature maps as a crucial intermediate representation. Finally, it outlines future research directions for forecasting challenges, including end-to-end vs. stage-wise methods, leveraging trajectory data, data augmentation, and improving runtime for on-car deployment. This segment features two talks on scaling autonomous driving. Congcong Li from Waymo discusses their approach to building robust and reliable self-driving systems. Nick Roy from Zoox then presents their architecture, highlighting early sensor fusion, learned depth completion, and strategies for handling out-of-distribution scenarios using foundation models and scenario generation. The talks emphasize the importance of data, computational efficiency, and structured representations in developing advanced autonomous driving technologies. This segment features two talks. The first, by Alex Kendall from Wayve, discusses the path to embodied AI, highlighting the importance of simulation, multimodality, and scale in developing autonomous driving systems. He introduces Wayve’s novel simulation and language-prompted driving models. The second talk, by Georgios Pavlakos from the University of Texas at Austin, focuses on perceiving humans in 4D from monocular video, showcasing advancements in 3D human pose estimation, tracking, and scene reconstruction, even in complex and interactive scenarios.

Speakers

Bolei Zhou — UCLA
Vincent Casser — Waymo
Alex Liniger — The AI Institute
Jose Alvarez — NVIDIA
Maying Shen — NVIDIA
Nigamaa Nayakanti — Google DeepMind
Jannik Zürn — Wayve
Dragomir Anguelov — Waymo
John Leonard — MIT
Luc Van Gool — ETH Zurich/KUL Leuven/INSAT
Sanja Fidler — NVIDIA
James H. Hays — Georgia Tech
Kyle Vedder — Argoverse
Neehar Peri — Argoverse
Tarasha Khurana — Argoverse
Quinlan Sykora — Waabi, University of Toronto
Ben Agro — Waabi, University of Toronto
Nick Roy — Zoox/MIT
Congcong Li — Waymo
Alex Kendall — Wayve
Georgios Pavlakos — University of Texas at Austin
Vickie Ye
Jitendra Malik
Angjoo Kanazawa

Talks (16)

00:00:00 — Unidentified: Welcome and Workshop Overview
- An overview of the CVPR 2024 Workshop on Autonomous Driving, including keynote speakers, challenges, paper track, organizers, and schedule.
00:06:34 — Unidentified: Introduction to Keynote Speaker Bolei Zhou
- Introduction of Bolei Zhou from UCLA, highlighting his research focus on embedded AI systems, interpretability, generalizability, and safe methods aligned with humans.
00:08:55 — Bolei Zhou: Building an Open-Source Simulation Ecosystem for AI and Mobility Research
- This talk addresses the growing gap between autonomous driving industry and academia, proposing an open-source simulation ecosystem (MetaDriveVerse) that integrates real-world data and advanced simulators to foster research and development in autonomous driving.
02:23:59 — Sanja Fidler: Next-Gen AV with Foundation Models
- Sanja Fidler discusses NVIDIA’s full-stack approach to AV development, emphasizing the transition to foundation model-empowered AI stacks, leveraging generative AI and advanced simulation for robust and safe autonomous driving.
02:26:50 — James H. Hays: Argoverse Competitions 2024
- James H. Hays introduces the Argoverse Competitions 2024, detailing the dataset’s features, its community-driven nature, and outlining the session’s goals to showcase state-of-the-art methods and encourage participation.
02:27:47 — Kyle Vedder: (Un)supervised Scene Flow
- Kyle Vedder presents the (Un)supervised Scene Flow challenge, introducing the ‘Bucket Normalized EPE’ metric for 2024 to better evaluate performance on vulnerable road users, and highlights Flow4D and ICP-Flow as winners for supervised and unsupervised tracks, respectively.
02:29:39 — Neehar Peri: End-to-End Forecasting
- Neehar Peri discusses the End-to-End Forecasting challenge, covering problem setups for 3D object detection, multi-object tracking, and end-to-end forecasting, and presents Team Le3DE2E and Team Valeo4Cast as winners for different sub-challenges.
02:32:53 — Tarasha Khurana: 4D Occupancy Forecasting
- Tarasha Khurana presents the 4D Occupancy Forecasting challenge, outlining its problem setup, metrics, and announcing Team Le3DE2E as the winner for their 3D UNet architecture with multi-scale feature fusion.
02:33:36 — James H. Hays: Conclusion of Argoverse Competitions Session
- James H. Hays concludes the Argoverse Competitions session, thanking the speakers and encouraging continued community engagement and participation in future challenges.
02:45:01 — Unknown: Safe Planning Requires Forecasting Joint States
- This talk provides an overview of the Argoverse 2024 forecasting challenges, detailing the problem setups, evaluation metrics, and highlighting the winning methods for both multi-agent and single-agent forecasting tasks.
02:57:02 — Quinlan Sykora: A 4D Occupancy Foundation Model (Uno)
- This talk details UnO, the winning method for 4D Occupancy Forecasting, covering its architecture, how it derives unsupervised occupancy labels from LiDAR, and its strong performance in both LiDAR and occupancy forecasting tasks.
03:05:22 — Tarasha Khurana: 4D Occupancy Forecasting Challenge Runner-Up: Team NLK
- This talk presents the runner-up method, NLK, for the 4D Occupancy Forecasting Challenge, detailing its UNet-based architecture and experimental results, showing its effectiveness in dense occupancy estimation.
04:07:31 — Congcong Li: Buckle Up for the Future: Scaling Autonomous Driving
- This talk discusses the challenges and approaches to scaling autonomous driving technology, focusing on Waymo’s strategies for achieving robust and reliable self-driving systems.
04:07:31 — Nick Roy: ZOX
- This talk presents Zoox’s approach to building scalable and robust autonomy architectures, emphasizing early sensor fusion, learned depth completion, and handling out-of-distribution scenarios through foundation models and scenario generation.
06:52:42 — Alex Kendall: The Road to Embodied AI
- Alex Kendall discusses the challenges and opportunities in building embodied AI systems, particularly for autonomous driving, emphasizing the need for robust simulation, multimodal understanding, and scalable engineering.
08:37:42 — Georgios Pavlakos: Perceiving Humans in 4D
- Georgios Pavlakos presents research on perceiving humans in 4D (3D space + time) from monocular video, focusing on robust 3D pose estimation and tracking in challenging real-world scenarios.

Key Takeaways

The autonomous driving research community benefits from open-source data and simulation tools to bridge the gap between industry and academia.
MetaDriveVerse provides a comprehensive open-source ecosystem for data-driven simulation, scenario generation, and testing of autonomous driving systems.
Generative AI and adversarial training techniques can be leveraged to create diverse and safety-critical scenarios, improving the robustness of autonomous driving policies.
Future research in mobility extends beyond roads to public urban spaces with diverse mobile machines, requiring new simulation environments like MetaUrban for embodied AI research.
Foundation models and generative AI are becoming central to developing next-generation autonomous driving systems, enabling more robust and scalable solutions.
Advanced simulation tools, including neural simulators and hybrid rendering, are crucial for accelerating AV development, testing, and ensuring safety.
Evaluation metrics are evolving to better assess performance on challenging scenarios and vulnerable road users, moving beyond traditional metrics like Average Endpoint Error.
Innovative architectures and training strategies, such as early temporal fusion in Flow4D and pre-training on large-scale motion forecasting data for MTR, are driving significant performance improvements in perception and forecasting tasks.
Multi-world forecasting is crucial for safe planning in autonomous driving, requiring joint future state predictions for all actors.
Self-supervised 4D occupancy forecasting offers a promising alternative to traditional point cloud forecasting, reducing reliance on costly human annotations.
Winning methods in both multi-agent and 4D occupancy forecasting challenges leverage attention mechanisms and efficient architectural designs, with BEV feature maps emerging as a key intermediate representation.
Future research in forecasting challenges will focus on end-to-end methods, leveraging large-scale trajectory data, data augmentation, and optimizing runtime for real-world deployment.
Early sensor fusion, combining data from cameras, lidars, and radars, is crucial for building a unified and robust representation of the environment in autonomous driving.
Handling out-of-distribution (OOD) scenarios is a significant challenge, and leveraging foundation models with aligned features can improve OOD detection while maintaining computational efficiency.
Structured autonomy architectures that combine learned components with symbolic planning offer benefits in interpretability, verifiability, and flexibility, allowing for better human oversight and complex mission execution.
Scenario generation using diffusion models and token conditioning provides a powerful tool for testing and validating autonomous systems in diverse and targeted environments, reducing reliance on costly real-world data collection.
Embodied AI, particularly in autonomous driving, requires robust simulation environments that can handle dynamic, deformable scenes and provide controllable, data-driven, and scalable testing.
Multimodality, especially integrating language with vision and action, is crucial for building trust, improving explainability, and enabling more intelligent and safe autonomous systems.
Advancements in 4D human perception allow for accurate 3D pose estimation, tracking, and reconstruction of humans and their interactions within complex real-world environments from monocular video.
Foundation models are emerging as a powerful paradigm for embodied AI, but they face unique challenges related to data quantity and diversity, training compute for high-dimensional video, and the complexities of physical embodiment and safety validation.

Methods / Models / Datasets Mentioned

3D UNet
4D-Occ
AB3DMOT
AlphaStar
Argoverse 1 & 2
BERT
BEVGen
BLIP-2 Q-Former
BehaviorGPT
Bucket Normalized EPE
CLIP
CLS
COMPASS
CarRacing
Carla
Closed-Loop Adversarial Training (CAT)
Composite Detection Score (CDS)
DINOv2 Base
DINOv2 Small
DLSS3.0
DROID-SLAM
Dota2
DrivingDiffusion
Ego-MLP
Flow4D
FocalFormer3D
Forecasting AP (APf)
FutureNet-LOF
GAIA
GLAMR
GMM1 VehicleLightsTrucks
Gato
GenAI
Ghost Gym
HMR
HMR 2.0
Higher Order Tracking Accuracy (HOTA)
ICP-Flow
ImplicitO
IoU
JAX library
LLaMA
LM-Nav
Lingo-1
Lingo-2
Lite-QCNet
MOTR
MTR (Motion Transformer)
MagicDrive
MetaDrive
MetaUrban
MobileNetV2
MultiPhys
NLK
NVIDIA DRIVE Sim
NVIDIA Drive Thor
NVIDIA Omniverse
Nemotron-4 340B
Neural Reconstruction Engine
OLS
Open X-Embodiment
OpenPilot
PARA-Drive
PARE
PDM Score
PRISM-1
Picasso
Precision
PyMAF-X
QCNet
ROS system
RT-1
RT-2
Recall
SEPT++
SLAHMR
SMARTS
ScenarioNet
SeFlow
SimGen
TrackFlow
TrafficGen
Transfuser
UnO
UniAD
VILA
VL Classifier
VQ-GAN
Waymo Open Dataset
Wayve GAIA-1
WayveScenes101
fVDB
minBrierFDE
minWorldBrierFDE
nuPlan
nuScenes
waabi UniSim

Topics

3D Human Pose Estimation · 4D Occupancy Forecasting · 4D Reconstruction · Argoverse Dataset · Autonomous Driving · Autonomous Driving Challenges · Autonomy Architectures · BEV Feature Maps · Benchmarking · Driving Simulators · Embodied AI · Evaluation Metrics · Forecasting Metrics · Foundation Models · Generative AI · Human-Scene Interaction · Learned Depth Completion · LiDAR Forecasting · Motion Forecasting · Multi-Object Tracking · Multi-World Forecasting · Multimodality · Object Detection · Open-Source Simulation · Out-of-Distribution Detection · Real-World Data · Reinforcement Learning · Safety-Critical Scenarios · Scaling · Scenario Generation · Scene Flow · Self-supervised Learning · Sensor Fusion · Simulation · Single-Agent Forecasting

Notes

Open for commentary — connections to other work, critiques, follow-up reading.