CVPR 2024 Tutorial: End-to-End Autonomy: A New Era of Self-Driving

Event: CVPR 2024 Tutorial · Duration: 239 min · ▶ Watch on YouTube

Abstract

This tutorial provides a comprehensive overview of end-to-end autonomy in self-driving vehicles, tracing its evolution from traditional modular systems to advanced AI-driven solutions. It delves into the core motivations behind this paradigm shift, highlighting the limitations of rule-based and data-driven approaches in handling the complexity and unpredictability of real-world driving scenarios. The tutorial showcases Wayve’s pioneering work in developing neural simulators like Ghost Gym and PRISM-1, which enable data-driven, scalable, and controllable simulation environments. It also explores the transformative potential of generative world models, such as GAIA-1 and VISTA, for data generation, understanding complex interactions, and robust policy learning. A significant portion is dedicated to the integration of Large Language Models (LLMs) in autonomous driving, emphasizing their role in enhancing explainability, reasoning, and multimodal perception through models like LINGO-1 and LINGO-2. The tutorial concludes by addressing critical challenges related to data scale, efficiency, safety, and regulatory ambiguities, while outlining future trends towards foundation models, zero-shot learning, and the ultimate goal of achieving human-level general intelligence in autonomous systems.

Speakers

Long Chen — Wayve
Jamie Shotton — Wayve
Hongyang Li — Shanghai AI Lab / University of Hong Kong
Nikhil Mohan — Wayve
Gianluca Corrado — Wayve
Oleg Sinavski — Wayve
Elahe Arani — Wayve

Talks (7)

00:00:00 — Long Chen: CVPR 2024 Tutorial: End-to-End Autonomy: A New Era of Self-Driving
- Introduction to the CVPR 2024 tutorial on End-to-End Autonomy, highlighting the shift towards end-to-end solutions in both industry and academia, with a detailed schedule of speakers and topics.
00:03:17 — Jamie Shotton: The Road to Embodied AI
- An overview of the accelerating progress in AI, emphasizing the shift towards embodied AI, particularly in autonomous driving, and introducing Wayve’s end-to-end approach to tackle the complexities of real-world driving.
00:51:00 — Hongyang Li: Could Foundation Models really resolve End-to-end Autonomy?
- An exploration of whether foundation models can truly resolve end-to-end autonomy, discussing the shift from traditional modular systems to end-to-end solutions, the role of world models, and the challenges of data scale, efficiency, and safety in autonomous driving.
01:25:08 — Nikhil Mohan: Towards a Neural Simulator: Offline evaluation of end-to-end autonomous vehicles
- A deep dive into the need for neural simulators for offline evaluation of end-to-end autonomous vehicles, highlighting the challenges of traditional AV stacks and introducing Wayve’s Ghost Gym and PRISM-1 as data-driven, scalable, and controllable simulation platforms.
02:05:07 — Gianluca Corrado: Learning Models of the World: Exploring Generative World Models in Autonomous Driving
- An exploration of generative world models in autonomous driving, tracing their evolution from early neural network-based approaches to modern transformer and diffusion models, and highlighting their potential for data generation, world understanding, and robust policy learning.
02:43:49 — Oleg Sinavski: Language Meet Driving: Empowering End-to-End Autonomous Driving with Large Language Models
- An exploration of how Large Language Models (LLMs) are empowering end-to-end autonomous driving, focusing on their role in explainability, reasoning, and multimodal integration, and discussing the challenges and future trends in developing robust, efficient, and trustworthy autonomous systems.
03:24:19 — Elahe Arani: Navigating the Future of End-to-End Autonomous Driving: Reflections and Future Directions
- A comprehensive overview of the challenges and future trends in end-to-end autonomous driving, emphasizing the need for robust benchmarking, advanced simulation, and efficient, interpretable, and safe systems, while highlighting the potential of foundation models and multimodal integration.

Key Takeaways

End-to-end autonomy is a paradigm shift in self-driving, moving away from traditional modular systems to integrated AI solutions that leverage raw sensor data for direct control.
Neural simulators like Ghost Gym and PRISM-1 are crucial for offline evaluation, enabling data-driven, scalable, and controllable testing of autonomous vehicles in complex, dynamic environments.
Generative world models (e.g., GAIA-1, VISTA) offer significant potential for data generation, understanding complex interactions, and robust policy learning, allowing for the simulation of diverse and challenging scenarios.
Large Language Models (LLMs) are increasingly integrated into autonomous driving systems to enhance explainability, reasoning, and multimodal perception, fostering trust and enabling more informed decision-making.
The future of end-to-end autonomous driving lies in the development of foundation models, multimodal integration, and efficient, interpretable, and safe systems that can adapt to novel situations and generalize across diverse environments.

Methods / Models / Datasets Mentioned

Ghost Gym
PRISM-1
UniAD
GAIA-1
VISTA
Lingo-1
Lingo-2
MCTS
GNN
GPT-3
GPT-3.5
GPT-4
GenAD
DriveGPT4
RAG-driver
LMdrive
Nuro
Drive Anywhere
LangProp
LaMPilot
DOROTHIE
HILM-D
RSSM
Dreamer v1
Dreamer v2
Dreamer v3
Phenaki
IRIS
Sora
V-JEPA
MILE
SEM2
Drive-WM
Coplilot4D
OccWorld
DriveWorld
DriveDreamer
TrafficBots
Panacea
SubjectDrive
LidarDM
Iso-Dream
UniWorld
MUVO
VIDAR
WoVoGen
Think2Drive
DriveAGI
DriveLM
OpenLane
VIDAR
ELM
DriveAdapter
MP3
CLIP
Q-Former
Flan-T5
LoRA
Llama
PID controller
Model Predictive Control
VQ-VAE
VQ-GAN
VIVIT
COLMAP
Nerfstudio
NeRF
HyperNeRF
Nerfies
NSFF
iPhone
D-NeRF
DriveSim
Carla
Waabi World
Waymo's Waymax
nuScenes
Waymo
Argoverse2
nuPlan
KITTI-360
Openpilot
CNN E2E
BDD-V
CILRS
Conditional IL
DArB
AGILEAD
SafeDagger
Generalization
NMP
BDD-X
PlanT
Patch-wise Feature Extraction
Multimodal Foundation Model
Policy Network
Transformer Block
ST-Adapter
Visual Encoder
AvgPool
Q-Former
Linear
Large Language Model
Enumeration Module
Incorporation Module
HR Spatial Extractor
Cross-attention
MLP
Query Detection Module
Lingo-1
Lingo-Judge
Lingo-2
AgentsCoDriver
DriveGPT4
RAG-driver
LMdrive
Nuro
Drive Anywhere
LLM-based Planner
Low-level Controller
PID
Wayve's Vision Model
Vision Encoder
Visual Tokens
Prediction Headers
BEV Map
Traffic Light States
Waypoint
Target Point
Future Waypoints
Multi-view RGB & LiDAR
Navigation Instruction
Action Tokens
Instruction Following
Attention based chain-of-thought
CarLLaVA
Llama
WayveScenes101
Open Drive Lab
DriveAGI
VQ-VAE
VQ-GAN
VIVIT
COLMAP
Nerfstudio
NeRF
HyperNeRF
Nerfies
NSFF
D-NeRF
DriveDreamer
TrafficBots
Panacea
SubjectDrive
LidarDM
Iso-Dream
UniWorld
MUVO
VIDAR
WoVoGen
Think2Drive
GenAD
DriveWorld
Tesla World Model
SEM2
GAIA-1
ADrive-I
Copilot4D
OccWorld
Drive-WM
VISTA
NeuRAD
Sora
V-JEPA
Delphi

Topics

End-to-End Autonomy · Neural Simulators · Generative World Models · Large Language Models (LLMs) · Explainability · Reasoning · Multimodal Integration · Data Scale · Safety and Reliability · Foundation Models

Notes

Open for commentary — connections to other work, critiques, follow-up reading.