CVPR 2024 Workshop on Autonomous Driving
Event: CVPR 2024 Workshop · Duration: 495 min · ▶ Watch on YouTube
Abstract
This segment opens with a welcome and overview of the CVPR 2024 Workshop on Autonomous Driving, detailing its structure, keynote speakers, challenges, and paper track. It then transitions into the first keynote address by Bolei Zhou from UCLA, who discusses the critical need for an open-source simulation ecosystem to bridge the widening gap between industry and academia in autonomous driving research. He introduces MetaDriveVerse, a platform designed to integrate diverse real-world datasets and advanced simulators, enabling the generation and testing of safety-critical scenarios and fostering new research opportunities in areas like multi-agent learning and urban mobility. This segment features presentations on cutting-edge research and challenges in autonomous driving perception. Sanja Fidler from NVIDIA discusses the development of next-gen AVs powered by foundation models, emphasizing a full-stack approach from hardware to simulation. Following this, the Argoverse Competitions 2024 are introduced, detailing challenges in (Un)supervised Scene Flow, End-to-End Forecasting, Multi-Agent Motion Forecasting, and 4D Occupancy Forecasting. Speakers present the problem setups, evaluation metrics, and winning methodologies for each challenge, highlighting innovative architectures, training strategies, and the importance of robust simulation for advancing autonomous driving technology. This segment provides an overview of the Argoverse 2024 forecasting challenges, covering both multi-agent and single-agent tasks. It details the problem setups, evaluation metrics, and highlights the winning methods, Lite-QCNet and FutureNet-LOF, along with their architectural innovations. The segment then introduces the 4D Occupancy Forecasting challenge, explaining its self-supervised nature, evaluation methodology, and presenting the winning (UnO) and runner-up (NLK) solutions, emphasizing the role of BEV feature maps as a crucial intermediate representation. Finally, it outlines future research directions for forecasting challenges, including end-to-end vs. stage-wise methods, leveraging trajectory data, data augmentation, and improving runtime for on-car deployment. This segment features two talks on scaling autonomous driving. Congcong Li from Waymo discusses their approach to building robust and reliable self-driving systems. Nick Roy from Zoox then presents their architecture, highlighting early sensor fusion, learned depth completion, and strategies for handling out-of-distribution scenarios using foundation models and scenario generation. The talks emphasize the importance of data, computational efficiency, and structured representations in developing advanced autonomous driving technologies. This segment features two talks. The first, by Alex Kendall from Wayve, discusses the path to embodied AI, highlighting the importance of simulation, multimodality, and scale in developing autonomous driving systems. He introduces Wayve’s novel simulation and language-prompted driving models. The second talk, by Georgios Pavlakos from the University of Texas at Austin, focuses on perceiving humans in 4D from monocular video, showcasing advancements in 3D human pose estimation, tracking, and scene reconstruction, even in complex and interactive scenarios.
Speakers
- Bolei Zhou — UCLA
- Vincent Casser — Waymo
- Alex Liniger — The AI Institute
- Jose Alvarez — NVIDIA
- Maying Shen — NVIDIA
- Nigamaa Nayakanti — Google DeepMind
- Jannik Zürn — Wayve
- Dragomir Anguelov — Waymo
- John Leonard — MIT
- Luc Van Gool — ETH Zurich/KUL Leuven/INSAT
- Sanja Fidler — NVIDIA
- James H. Hays — Georgia Tech
- Kyle Vedder — Argoverse
- Neehar Peri — Argoverse
- Tarasha Khurana — Argoverse
- Quinlan Sykora — Waabi, University of Toronto
- Ben Agro — Waabi, University of Toronto
- Nick Roy — Zoox/MIT
- Congcong Li — Waymo
- Alex Kendall — Wayve
- Georgios Pavlakos — University of Texas at Austin
- Vickie Ye
- Jitendra Malik
- Angjoo Kanazawa
Talks (16)
- 00:00:00 — Unidentified: Welcome and Workshop Overview
- An overview of the CVPR 2024 Workshop on Autonomous Driving, including keynote speakers, challenges, paper track, organizers, and schedule.
- 00:06:34 — Unidentified: Introduction to Keynote Speaker Bolei Zhou
- Introduction of Bolei Zhou from UCLA, highlighting his research focus on embedded AI systems, interpretability, generalizability, and safe methods aligned with humans.
- 00:08:55 — Bolei Zhou: Building an Open-Source Simulation Ecosystem for AI and Mobility Research
- This talk addresses the growing gap between autonomous driving industry and academia, proposing an open-source simulation ecosystem (MetaDriveVerse) that integrates real-world data and advanced simulators to foster research and development in autonomous driving.
- 02:23:59 — Sanja Fidler: Next-Gen AV with Foundation Models
- Sanja Fidler discusses NVIDIA’s full-stack approach to AV development, emphasizing the transition to foundation model-empowered AI stacks, leveraging generative AI and advanced simulation for robust and safe autonomous driving.
- 02:26:50 — James H. Hays: Argoverse Competitions 2024
- James H. Hays introduces the Argoverse Competitions 2024, detailing the dataset’s features, its community-driven nature, and outlining the session’s goals to showcase state-of-the-art methods and encourage participation.
- 02:27:47 — Kyle Vedder: (Un)supervised Scene Flow
- Kyle Vedder presents the (Un)supervised Scene Flow challenge, introducing the ‘Bucket Normalized EPE’ metric for 2024 to better evaluate performance on vulnerable road users, and highlights Flow4D and ICP-Flow as winners for supervised and unsupervised tracks, respectively.
- 02:29:39 — Neehar Peri: End-to-End Forecasting
- Neehar Peri discusses the End-to-End Forecasting challenge, covering problem setups for 3D object detection, multi-object tracking, and end-to-end forecasting, and presents Team Le3DE2E and Team Valeo4Cast as winners for different sub-challenges.
- 02:32:53 — Tarasha Khurana: 4D Occupancy Forecasting
- Tarasha Khurana presents the 4D Occupancy Forecasting challenge, outlining its problem setup, metrics, and announcing Team Le3DE2E as the winner for their 3D UNet architecture with multi-scale feature fusion.
- 02:33:36 — James H. Hays: Conclusion of Argoverse Competitions Session
- James H. Hays concludes the Argoverse Competitions session, thanking the speakers and encouraging continued community engagement and participation in future challenges.
- 02:45:01 — Unknown: Safe Planning Requires Forecasting Joint States
- This talk provides an overview of the Argoverse 2024 forecasting challenges, detailing the problem setups, evaluation metrics, and highlighting the winning methods for both multi-agent and single-agent forecasting tasks.
- 02:57:02 — Quinlan Sykora: A 4D Occupancy Foundation Model (Uno)
- This talk details UnO, the winning method for 4D Occupancy Forecasting, covering its architecture, how it derives unsupervised occupancy labels from LiDAR, and its strong performance in both LiDAR and occupancy forecasting tasks.
- 03:05:22 — Tarasha Khurana: 4D Occupancy Forecasting Challenge Runner-Up: Team NLK
- This talk presents the runner-up method, NLK, for the 4D Occupancy Forecasting Challenge, detailing its UNet-based architecture and experimental results, showing its effectiveness in dense occupancy estimation.
- 04:07:31 — Congcong Li: Buckle Up for the Future: Scaling Autonomous Driving
- This talk discusses the challenges and approaches to scaling autonomous driving technology, focusing on Waymo’s strategies for achieving robust and reliable self-driving systems.
- 04:07:31 — Nick Roy: ZOX
- This talk presents Zoox’s approach to building scalable and robust autonomy architectures, emphasizing early sensor fusion, learned depth completion, and handling out-of-distribution scenarios through foundation models and scenario generation.
- 06:52:42 — Alex Kendall: The Road to Embodied AI
- Alex Kendall discusses the challenges and opportunities in building embodied AI systems, particularly for autonomous driving, emphasizing the need for robust simulation, multimodal understanding, and scalable engineering.
- 08:37:42 — Georgios Pavlakos: Perceiving Humans in 4D
- Georgios Pavlakos presents research on perceiving humans in 4D (3D space + time) from monocular video, focusing on robust 3D pose estimation and tracking in challenging real-world scenarios.
Key Takeaways
- The autonomous driving research community benefits from open-source data and simulation tools to bridge the gap between industry and academia.
- MetaDriveVerse provides a comprehensive open-source ecosystem for data-driven simulation, scenario generation, and testing of autonomous driving systems.
- Generative AI and adversarial training techniques can be leveraged to create diverse and safety-critical scenarios, improving the robustness of autonomous driving policies.
- Future research in mobility extends beyond roads to public urban spaces with diverse mobile machines, requiring new simulation environments like MetaUrban for embodied AI research.
- Foundation models and generative AI are becoming central to developing next-generation autonomous driving systems, enabling more robust and scalable solutions.
- Advanced simulation tools, including neural simulators and hybrid rendering, are crucial for accelerating AV development, testing, and ensuring safety.
- Evaluation metrics are evolving to better assess performance on challenging scenarios and vulnerable road users, moving beyond traditional metrics like Average Endpoint Error.
- Innovative architectures and training strategies, such as early temporal fusion in Flow4D and pre-training on large-scale motion forecasting data for MTR, are driving significant performance improvements in perception and forecasting tasks.
- Multi-world forecasting is crucial for safe planning in autonomous driving, requiring joint future state predictions for all actors.
- Self-supervised 4D occupancy forecasting offers a promising alternative to traditional point cloud forecasting, reducing reliance on costly human annotations.
- Winning methods in both multi-agent and 4D occupancy forecasting challenges leverage attention mechanisms and efficient architectural designs, with BEV feature maps emerging as a key intermediate representation.
- Future research in forecasting challenges will focus on end-to-end methods, leveraging large-scale trajectory data, data augmentation, and optimizing runtime for real-world deployment.
- Early sensor fusion, combining data from cameras, lidars, and radars, is crucial for building a unified and robust representation of the environment in autonomous driving.
- Handling out-of-distribution (OOD) scenarios is a significant challenge, and leveraging foundation models with aligned features can improve OOD detection while maintaining computational efficiency.
- Structured autonomy architectures that combine learned components with symbolic planning offer benefits in interpretability, verifiability, and flexibility, allowing for better human oversight and complex mission execution.
- Scenario generation using diffusion models and token conditioning provides a powerful tool for testing and validating autonomous systems in diverse and targeted environments, reducing reliance on costly real-world data collection.
- Embodied AI, particularly in autonomous driving, requires robust simulation environments that can handle dynamic, deformable scenes and provide controllable, data-driven, and scalable testing.
- Multimodality, especially integrating language with vision and action, is crucial for building trust, improving explainability, and enabling more intelligent and safe autonomous systems.
- Advancements in 4D human perception allow for accurate 3D pose estimation, tracking, and reconstruction of humans and their interactions within complex real-world environments from monocular video.
- Foundation models are emerging as a powerful paradigm for embodied AI, but they face unique challenges related to data quantity and diversity, training compute for high-dimensional video, and the complexities of physical embodiment and safety validation.
Methods / Models / Datasets Mentioned
3D UNet4D-OccAB3DMOTAlphaStarArgoverse 1 & 2BERTBEVGenBLIP-2 Q-FormerBehaviorGPTBucket Normalized EPECLIPCLSCOMPASSCarRacingCarlaClosed-Loop Adversarial Training (CAT)Composite Detection Score (CDS)DINOv2 BaseDINOv2 SmallDLSS3.0DROID-SLAMDota2DrivingDiffusionEgo-MLPFlow4DFocalFormer3DForecasting AP (APf)FutureNet-LOFGAIAGLAMRGMM1 VehicleLightsTrucksGatoGenAIGhost GymHMRHMR 2.0Higher Order Tracking Accuracy (HOTA)ICP-FlowImplicitOIoUJAX libraryLLaMALM-NavLingo-1Lingo-2Lite-QCNetMOTRMTR (Motion Transformer)MagicDriveMetaDriveMetaUrbanMobileNetV2MultiPhysNLKNVIDIA DRIVE SimNVIDIA Drive ThorNVIDIA OmniverseNemotron-4 340BNeural Reconstruction EngineOLSOpen X-EmbodimentOpenPilotPARA-DrivePAREPDM ScorePRISM-1PicassoPrecisionPyMAF-XQCNetROS systemRT-1RT-2RecallSEPT++SLAHMRSMARTSScenarioNetSeFlowSimGenTrackFlowTrafficGenTransfuserUnOUniADVILAVL ClassifierVQ-GANWaymo Open DatasetWayve GAIA-1WayveScenes101fVDBminBrierFDEminWorldBrierFDEnuPlannuSceneswaabi UniSim
Topics
3D Human Pose Estimation · 4D Occupancy Forecasting · 4D Reconstruction · Argoverse Dataset · Autonomous Driving · Autonomous Driving Challenges · Autonomy Architectures · BEV Feature Maps · Benchmarking · Driving Simulators · Embodied AI · Evaluation Metrics · Forecasting Metrics · Foundation Models · Generative AI · Human-Scene Interaction · Learned Depth Completion · LiDAR Forecasting · Motion Forecasting · Multi-Object Tracking · Multi-World Forecasting · Multimodality · Object Detection · Open-Source Simulation · Out-of-Distribution Detection · Real-World Data · Reinforcement Learning · Safety-Critical Scenarios · Scaling · Scenario Generation · Scene Flow · Self-supervised Learning · Sensor Fusion · Simulation · Single-Agent Forecasting
Notes
Open for commentary — connections to other work, critiques, follow-up reading.