Foundation Models for Autonomous Systems Workshop

Event: CVPR 2024 Workshop · Duration: 561 min · ▶ Watch on YouTube

Abstract

This segment introduces the CVPR 2024 workshop on “Foundation Models for Autonomous Systems,” highlighting the motivation, grand challenge statistics, and an overview of the schedule and speakers. It then features a detailed talk by Sergey Levine on “Robotic Foundation Models,” where he discusses the evolution of AI from task-specific models to large pre-trained foundation models and explores how this paradigm can be applied to robotics for navigation and manipulation tasks. The segment concludes with the introduction of Sherry Yang’s talk on “Foundation Models as Real-World Simulators.” This segment features Alex Kendall, Co-founder & CEO of Wayve, discussing ‘The Road to Embodied AI’. He highlights the challenges and opportunities in developing embodied AI for autonomous driving, emphasizing Wayve’s end-to-end learning system. Key areas of focus include simulation for generating diverse data, multimodal training for improved understanding, and scaling data and compute to achieve generalizable and safe autonomous driving. The talk also covers Wayve’s open-source contributions like WayveScenes101 and their LINGO-2 model for language-prompted driving. This video segment introduces two challenges related to autonomous driving: NAVSIM (Non-Reactive Autonomous Vehicle Simulation and Benchmarking) and Predictive World Model. It features presentations of the top solutions for each challenge. NVIDIA’s Hydra-MDP, the 1st place NAVSIM solution, utilizes multi-target Hydra-Distillation for end-to-end multimodal planning. Huawei Noah & CUHK-SZ’s D²-World, the 2nd place Predictive World Model solution, proposes an efficient world model through decoupled dynamic flow for 4D occupancy forecasting. The segment highlights the importance of robust metrics and efficient models for advancing autonomous driving technology. This segment features presentations from the CVPR 2024 workshop, highlighting advancements in computer vision for autonomous systems and embodied AI. Talks cover topics such as 3D occupancy and flow prediction using adaptive view transformations, end-to-end autonomous driving with vision language models, and the development of visual foundation models. Key themes include self-supervised learning for robust 3D perception, synthetic data generation, object-centric representations, and interpretability of video transformers, showcasing cutting-edge research in these fields. This segment features presentations from the top-performing teams in the OpenDriveLab challenges, including LGmap’s solution for vectorized HD map construction, InternVL for DriveLM’s winning approach, and NVIDIA’s multimodal LLM solution for the Driving with Language Challenge. It also covers the CARLA Autonomous Driving Challenge, detailing its environments, metrics, and top scorers. The segment concludes with a deep dive into foundation models in the automotive industry, showcasing self-supervised learning techniques for both image and LiDAR data. This video segment features three distinct talks related to foundation models in autonomous systems. The first speaker concludes a discussion on ScaLR simplifications, their performance impact, and the influence of network scale and dataset diversity, followed by future perspectives and a Q&A. The second speaker, Ted Xiao from Google DeepMind, presents on ‘What’s Missing for Robotics-First Foundation Models?’, identifying critical gaps such as positive transfer from scale, steerability, and scalable evaluation in the current landscape of robotics foundation models. The final speaker, Li Chen from OpenDriveLab, introduces ‘Visual World Models as “Foundation” Models for Autonomous Systems’, showcasing the OpenDV-2K dataset and the GenAD video prediction model for various autonomous driving prediction tasks. This segment concludes a presentation on visual world models (Vista, ViDAR, MPI) for autonomous driving and robotics, followed by a Q&A session. The core of the segment then transitions into a panel discussion titled ‘Challenges in Building Foundations Models for Embodied AI,’ moderated by Anthony Hu. The panel covers three main topics: technical challenges in building foundation models (data volume, diversity, quality, algorithmic advances), real-world challenges in embodied AI (specialization, latency, performance measurement), and integrating AI systems with humans (large-scale deployment, safety, trust). The discussion highlights the importance of diverse data, robust evaluation, and the need for collaboration to advance embodied AI.

Speakers

Hongyang Li — Shanghai AI Lab
Sergey Levine — UC Berkeley, Physical Intelligence
Sherry Yang — Google DeepMind, UC Berkeley
Alex Kendall — Wayve
NAVSIM Organizers — NAVSIM Organizers
Zhenxin Li — Team NVIDIA
Zetong Yang — OpenDriveLab
Haiming Zhang — Huawei Noah & CUHK-SZ Team
Dubing Chen — University of Macau, INCEPTIO
Wencheng Han — University of Macau, INCEPTIO
Jin Fang — University of Macau, INCEPTIO
Jianbing Shen — University of Macau, INCEPTIO
Panqu Wang — ZERON
Rareș Ambruș — Toyota Research Institute
Kuang Wu — Lange Technology
Sulei Nian — Lange Technology
Can Shen — Lange Technology
Chuan Yang — Lange Technology
Zhanbin Li — Lange Technology
Zhiqi Li — Nanjing University
Tong Lu — Nanjing University
Zhiqi Ding — NVIDIA
Matt — CARLA Simulator Development Team
Katrin Renz — University of Tübingen
Andrei Bursuc — valeo.ai
Ted Xiao — Google DeepMind
Li Chen — OpenDriveLab at Shanghai AI Lab
Anthony Hu — Wayve (Moderator)
Christos Sakaridis — ETH Zürich

Talks (22)

00:28:45 — Sergey Levine: Robotic Foundation Models
- Discusses the concept and challenges of building robotic foundation models, contrasting them with NLP models, and exploring approaches for navigation and manipulation.
01:20:07 — Alex Kendall: The Road to Embodied AI
- This talk introduces Wayve’s approach to building generalizable embodied AI for autonomous driving, focusing on simulation, multimodality, and scale, and highlighting their end-to-end learning system and safety by design.
02:40:19 — NAVSIM Organizers: NAVSIM: Non-Reactive Autonomous Vehicle Simulation and Benchmarking
- Introduction to the NAVSIM challenge, highlighting the inadequacy of simple displacement error metrics and introducing a new PDM score for robust evaluation of driving behavior, announcing NVIDIA and ZERON as leading teams.
02:41:07 — Zhenxin Li: Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation – 1st Place Solution of End-to-end Driving at Scale
- Presentation of NVIDIA’s 1st place solution, Hydra-MDP, for the NAVSIM challenge, detailing its multi-modal planning with multi-target learning paradigm, Transfuser-based perception, VAD v2 + Distance-based Imitation Loss trajectory decoder, and multi-target Hydra-Distillation approach.
02:42:50 — Zetong Yang: Predictive World Model
- Introduction to the Predictive World Model challenge, focusing on visual point cloud forecasting to enable intelligent agents to perceive, react, and predict, using Chamfer distance as a metric and emphasizing cross-modality learning and training efficiency.
02:43:35 — Haiming Zhang: D²-World: An Efficient World Model through Decoupled Dynamic Flow
- Presentation of D²-World, the 2nd place solution for the Predictive World Model challenge, which reformulates visual point cloud forecasting as a 4D occupancy forecasting task using a two-stage pipeline with BEVDET-Occ, SALT blocks, and flow-guided warping & refinement.
04:01:38 — Dubing Chen, Wencheng Han, Jin Fang, Jianbing Shen: AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction
- This talk introduces AdaOcc, a method that secured 2nd place in the CVPR 2024 Autonomous Grand Challenge for 3D Occupancy and Flow, utilizing a pipeline with image backbone, depth/context heads, view transformation, 3D U-Net, and task-specific heads.
04:01:53 — Panqu Wang: End-to-End Autonomous Driving Using Vision Language Model
- This presentation details ZERON’s 2nd place camera-only solution for end-to-end autonomous driving, leveraging a Vision Language Model and synthetic data for improved perception and robustness in traffic sign recognition.
04:26:30 — Rareș Ambruș: Visual Foundation Models for Embodied Applications
- This talk from Toyota Research Institute explores the development of visual foundation models for embodied AI, focusing on robust 3D perception, motion capture, object-centric representations, and interpretability of video transformers.
05:23:06 — Kuang Wu: LGmap: Local-to-Global Mapping Network for Online Long-Range Vectorized HD Map Construction
- This talk presents LGmap, a solution for vectorized HD map construction, detailing its pipeline, view transformation, temporal fusion strategies, and hierarchical pedestrian representation, along with ablation studies and challenge results.
05:36:36 — Zhiqi Li: InternVL for DriveLM: CVPR 2024 Autonomous Grand Challenge
- This presentation introduces InternVL for DriveLM, the winning solution for the DriveLM challenge, detailing its architecture based on InternVL-1.5, data processing for multi-view images, fine-tuning strategy, and evaluation results using GPT as a referee.
05:43:36 — Zhiqi Ding: Driving with Multimodal LLM
- This presentation details NVIDIA’s second-place solution for the Driving with Language Challenge, highlighting the use of multimodal LLMs like Eagle and OmniDrive for tasks such as important object identification, decision-making, motion prediction, and grounding, and showcasing their performance in various traffic scenarios.
05:55:26 — Katrin Renz: CarLLaVA: Vision language models for camera-only closed-loop driving
- This talk introduces CarLLaVA, a vision language model for camera-only closed-loop driving, emphasizing its architecture, training methodology, and performance in the CARLA challenge, highlighting the importance of generalization, reasoning, and explainability in autonomous driving.
06:05:16 — Andrei Bursuc: Foundation models in the automotive industry
- This presentation explores the application of foundation models in the automotive industry, detailing Valeo’s history in ADAS, their sensor suite, scalable system architecture, and research into self-supervised learning for image and LiDAR data, including methods like POP-3D, BEVContrast, OccFeat, and ScaLR.
06:43:58 — Ted Xiao: What’s Missing for Robotics-First Foundation Models?
- This talk discusses the current state of robotics foundation models, highlighting missing pieces like positive transfer from scale, steerability, and scalable evaluations, and proposes future directions for robotics research.
07:27:27 — Li Chen: Visual World Models as “Foundation” Models for Autonomous Systems
- This talk introduces the concept of visual world models as foundation models for autonomous systems, detailing the OpenDV-2K dataset, the GenAD video prediction model, and its application in zero-shot generalization, language-conditioned prediction, and action-conditioned prediction for driving scenarios.
08:00:47 — Hongyang Li: Tasks Action-conditioned Prediction (Simulation)
- Concluding remarks on Vista, ViDAR, and MPI models, their applications in driving and robotics, and future directions, emphasizing data scale, model capabilities, and application in policy prediction and reward learning.
08:23:49 — Anthony Hu: Q&A on Visual World Models
- Q&A session addressing data sources (OpenDV, YouTube), model training (LoRA), and challenges in applying visual world models to robotics, including the ‘chicken-and-egg’ problem of data scaling.
08:31:34 — Anthony Hu: Panel / Debate: Challenges in Building Foundations Models for Embodied AI - Topic #1: Technical challenges in building a foundation model
- Panel discussion on the technical challenges of building foundation models, focusing on data volume, diversity, quality, and algorithmic advances, with emphasis on the need for diverse, large-scale, and collaborative data collection.
09:00:37 — Anthony Hu: Panel / Debate: Challenges in Building Foundations Models for Embodied AI - Topic #2: Real-world challenges in embodied AI
- Panel discussion on real-world challenges in embodied AI, including specializing foundation models, real-time latency constraints, measuring performance, and the role of simulation and physical realism.
09:11:09 — Anthony Hu: Panel / Debate: Challenges in Building Foundations Models for Embodied AI - Topic #3: Integrating AI systems with humans in everyday life
- Panel discussion on integrating AI systems with humans, covering large-scale deployment of autonomous systems, safety-critical scenarios, and building public trust, emphasizing robust evaluation and communication.
56:28:00 — Sherry Yang: Foundation Models as Real-World Simulators
- Introduces the idea of using internet-scale data to learn real-world simulators (world models) for decision-making in robotics.

Key Takeaways

Foundation models, initially successful in NLP and vision, are being explored for autonomous systems to address data scarcity and improve generalization.
Large-scale, diverse datasets, including real-world and synthetic data, are crucial for training effective foundation models in robotics.
Cross-embodiment learning, where models are trained on data from various robot platforms and tasks, shows promise for improving generalization and performance.
The development of open-source robotic foundation models and standards (like IEEE P3474) is fostering collaboration and accelerating research in the field.
Embodied AI, particularly in autonomous driving, represents the next frontier in AI, requiring robust world models and decision-making algorithms.
Simulation is crucial for developing and validating embodied AI systems, enabling the generation of diverse synthetic data for training and testing edge cases.
Multimodal training, integrating vision, language, and action, is essential for building intelligent, explainable, and trustworthy autonomous systems.
Wayve’s end-to-end approach, leveraging foundation models and safety by design, aims to unlock safe and generalizable autonomous driving by learning from diverse data at scale.
Traditional displacement error metrics are insufficient for evaluating autonomous driving policies, necessitating more comprehensive simulation-based metrics like the PDM score.
Multi-target learning and distillation from rule-based teachers can significantly improve the performance and safety of end-to-end autonomous driving systems.
Predictive world models are crucial for enabling intelligent agents to understand and react to their environment, with 4D occupancy forecasting emerging as a key task.
Efficient architectures and training strategies are vital for handling the large datasets and complex tasks involved in developing robust autonomous driving systems.
Advanced 3D occupancy and flow prediction models are achieving high performance in autonomous driving challenges by integrating various perception and fusion techniques.
End-to-end autonomous driving solutions are evolving towards simpler, more generalizable vision language models, leveraging synthetic data for training and robustness.
Visual foundation models are being developed to address key challenges in embodied AI, focusing on robust 3D perception, motion understanding, and object-centric representations.
Self-supervised learning and novel neural field architectures are crucial for efficient and accurate 3D scene reconstruction, depth estimation, and object tracking across diverse domains.
The OpenDriveLab challenges foster innovation in autonomous driving, with solutions like LGmap demonstrating advancements in vectorized HD map construction and InternVL for DriveLM showcasing the power of vision language models.
Multimodal LLMs are proving highly effective in autonomous driving tasks, enabling advanced reasoning, decision-making, and explainability in complex traffic scenarios.
Self-supervised learning is a crucial paradigm for developing robust foundation models in the automotive industry, allowing for efficient pre-training on vast amounts of unlabeled data and adaptation to diverse sensor modalities and downstream tasks.
The CARLA Autonomous Driving Challenge provides a valuable platform for evaluating and accelerating progress in autonomous driving research, with top-performing teams leveraging sophisticated architectures and training strategies.
Simplifications in model architecture and loss functions can lead to improved performance and efficiency in multimodal learning for autonomous systems.
Achieving general-purpose robotics requires addressing missing pieces like positive transfer from scale, improved steerability and promptability, and scalable evaluation methodologies.
Future advancements in robotics foundation models will likely involve leveraging large-scale internet data, developing motion-centric representations, and improving data interoperability across different robot embodiments and tasks.
Visual world models, trained on massive datasets like OpenDV-2K, show promise in zero-shot generalization and language-conditioned prediction for autonomous driving, indicating a path towards more intelligent and reliable autonomous systems.
Large-scale, diverse, and high-quality datasets are crucial for training generalized visual world models, especially for autonomous driving and robotics, with collaborative data collection being a key challenge.
Self-supervised learning and efficient training strategies (like LoRA) enable models to learn robust representations and generalize across various tasks and conditions, even in zero-shot settings.
Visual world models can be effectively applied to complex tasks like action-conditioned prediction, motion planning, and reward learning, but real-world deployment in embodied AI faces challenges related to latency, performance measurement, and safety.
Building public trust and ensuring safety are paramount for integrating AI systems with humans in everyday life, requiring robust evaluation, clear communication, and addressing the unique challenges of human-AI interaction.

Methods / Models / Datasets Mentioned

3D FPN
3D Packing for Self-Supervised Monocular Depth Estimation
3DCNN
4D-Occ
ADriver-I
ALIGN
ActivityNet Captions
AdaOcc
AdamW optimizer
Any-point Trajectory Modeling
BERT
BEVContrast
BEVDET-Occ
BEVFormer
BLIP-2 Q-Former
Bevdet
Bridge Data
CAF
CARLA
CBGS (Class-balanced Grouping and Sampling)
CLIP
CLIP ViT
CLIP encoder
CLIP-RN
CLIP-Text
COMPASS
CenterPoint
Chamfer Distance
Class weighted CE loss
Common Crawl
Context Net
Convex NMF
CrossFormer
DCN3D
DINO (VIT-S)
DINO-RN
DINO-ViT
DINOV2 (VIT-B)
DINOV2 (VIT-L)
DMVFN
DeLiRA (Self-Supervised Depth, Light, and Radiance Fields)
Depth Net
Depth Semantic Fusion
Dgss-evlab
DinoV2
Distance-based Imitation Loss
DistilBERT
Drive-WM
DriveDreamer
DriveGAN
DriveSim
D²-World
EPIC-KITCHENS
Eagle (2D Multimodal LLM)
Ego4D
FBOC
FastRLAP
Forward View Transformation (FVT)
GAIA
GAIA-1
GNM
GPT-3.5
GPT-4
GUDA (Geometry-guided Unsupervised Domain Adaptation)
Gato
GenAD
Genie
Ghost Gym
GitHub
HTF
HTF (Hierarchical Temporal Fusion)
Habitat HM3D
History Fusion
Honk_4626_Team
Hydra-MDP
I2VGen-XL
IQL Training
Ins Det
Ins Seg
InternLM2-Chat-20B
InternVIT-6B
InternVL-1.5
InternVideo
Kubric
LGmap
LINGO-1
LINGO-2
LLM
LLT
LLaMA
LLaVA-NeXT Vision Encoder
LM-Nav
LORA finetuning
LSS
LTT
Language Table sim
Le-IDE3E
Lightwheel 3D Reconstruction
Llama 2
LoRA
Lovasz-softmax loss
MAE
MAE (VIT-B)
MLP
MPI
MachMap
MapTR
Mask-Based Loss
Matterport Room-to-Room scans
MinkNet 34
MinkUNet
MonoDepth Network
Multi-Frame Self-Supervised Depth with Transformers
Multi-target Hydra-Distillation
NAVSIM
NMS-Based Ensemble
NeRF 3D Reconstruction
Neural Ray Surfaces for Self-Supervised Learning of Depth and Ego-motion
OccFeat
Octo
OmniDrive (3D Multimodal LLM)
Open X-Embodiment
OpenDV
OpenDV-Youtube
OpenLane-V1
OpenLane-V2
OpenVLA
PDM Score
PDM-lite
POP-3D
PRISM-1
PaLI
PaLI-X
PaLM-E
PackNet-SfM
Predictive World Model
RLPD Training
ROAD (Recursive Octree Auto-Decoder)
RT-1
RT-1 Data
RT-2
RT-2-X
RT-Hierarchy
RT-Trajectory
RT-X
Ray Casting
RayIoU
RayL@1
RayL@2
RayL@4
ReFiNe (Recursive Field Networks for Cross-Modal Multi-Scene Representation)
ResNet
ResNet-50
Resnet-34
RoboTAP
Robust Semi-Supervised Monocular Depth Estimation with Reprojected Distances
S2Net
SDXL
SLIC (Simple Iterative Linear Clustering)
ST-P3
SVT
ScaLR
Segmentation Supervision
Self-Supervised Camera Self-Calibration from Video
Semantically-Guided Representation Learning for Self-Supervised Monocular Depth
SigLIP
Sora
Sparse Auxiliary Networks for Unified Monocular Depth Prediction and Completion
Spatial-aware Local-temporal Attention Block (SALT)
Stack Overflow
StyleGAN2
SuperDepth
Swin Base
Swin-Transformer
TAG (Tracking at Any Granularity)
TCOW
TopoMLP
Track2Act
Transfuser
Two Stream Networks for Self-Supervised Ego-Motion Estimation
UniAD
UniSim
UniTraj
VAD v2
VIDAR
VIT
VLM
VLP
VQ-GAN
VQ-VAE
VTCD (Understanding Video Transformers via Universal Concept Discovery)
Valeo4Cast
ViDAR
ViNT
ViT
Vicuna-7B
VideoCrafter1
VideoMAE - SSv2
Vision Language Model (LLM)
Vista
Waabi World
WaffleIron-256
Waymo's Waymax
WayveScenes101
Wikipedia
WoVoGen
WolframAlpha
YOLO
YouTube
ZeroDepth (Towards Zero-Shot Scale-Aware Monocular Depth Estimation)
mAVE
mAVE@LQ
mAVE@Per-voxel
mAVE@TP
nuPlan
nuScenes

Topics

3D Perception · AI Alignment · Autonomous Driving · Autonomous Systems · Benchmarking · CARLA Challenge · Computer Vision · Cross-Embodiment Learning · Data Diversity · Data Scaling · Data Scarcity · Decision Making · Deep Learning · Embodied AI · End-to-End Learning · Flow Prediction · Foundation Models · HD Maps · Human-AI Interaction · Interpretability · Lane Detection · Multi-modal Planning · Multimodal LLMs · Multimodality · Neural Rendering · Object Tracking · Occupancy Forecasting · Occupancy and Flow Prediction · Predictive World Model · Real-world Challenges · Robotics · Safety by Design · Scalable Evaluation · Self-Supervised Learning · Self-driving cars · Self-supervised Learning · Sensor Fusion · Simulation · Steerability · Vehicle Simulation · Video Prediction Models · Vision Language Models · Visual Foundation Models · Visual World Models · World Models

Notes

Open for commentary — connections to other work, critiques, follow-up reading.