Foundation Models for Autonomous Systems Workshop

Event: CVPR 2024 Workshop · Duration: 561 min · ▶ Watch on YouTube

Abstract

This segment introduces the CVPR 2024 workshop on “Foundation Models for Autonomous Systems,” highlighting the motivation, grand challenge statistics, and an overview of the schedule and speakers. It then features a detailed talk by Sergey Levine on “Robotic Foundation Models,” where he discusses the evolution of AI from task-specific models to large pre-trained foundation models and explores how this paradigm can be applied to robotics for navigation and manipulation tasks. The segment concludes with the introduction of Sherry Yang’s talk on “Foundation Models as Real-World Simulators.” This segment features Alex Kendall, Co-founder & CEO of Wayve, discussing ‘The Road to Embodied AI’. He highlights the challenges and opportunities in developing embodied AI for autonomous driving, emphasizing Wayve’s end-to-end learning system. Key areas of focus include simulation for generating diverse data, multimodal training for improved understanding, and scaling data and compute to achieve generalizable and safe autonomous driving. The talk also covers Wayve’s open-source contributions like WayveScenes101 and their LINGO-2 model for language-prompted driving. This video segment introduces two challenges related to autonomous driving: NAVSIM (Non-Reactive Autonomous Vehicle Simulation and Benchmarking) and Predictive World Model. It features presentations of the top solutions for each challenge. NVIDIA’s Hydra-MDP, the 1st place NAVSIM solution, utilizes multi-target Hydra-Distillation for end-to-end multimodal planning. Huawei Noah & CUHK-SZ’s D²-World, the 2nd place Predictive World Model solution, proposes an efficient world model through decoupled dynamic flow for 4D occupancy forecasting. The segment highlights the importance of robust metrics and efficient models for advancing autonomous driving technology. This segment features presentations from the CVPR 2024 workshop, highlighting advancements in computer vision for autonomous systems and embodied AI. Talks cover topics such as 3D occupancy and flow prediction using adaptive view transformations, end-to-end autonomous driving with vision language models, and the development of visual foundation models. Key themes include self-supervised learning for robust 3D perception, synthetic data generation, object-centric representations, and interpretability of video transformers, showcasing cutting-edge research in these fields. This segment features presentations from the top-performing teams in the OpenDriveLab challenges, including LGmap’s solution for vectorized HD map construction, InternVL for DriveLM’s winning approach, and NVIDIA’s multimodal LLM solution for the Driving with Language Challenge. It also covers the CARLA Autonomous Driving Challenge, detailing its environments, metrics, and top scorers. The segment concludes with a deep dive into foundation models in the automotive industry, showcasing self-supervised learning techniques for both image and LiDAR data. This video segment features three distinct talks related to foundation models in autonomous systems. The first speaker concludes a discussion on ScaLR simplifications, their performance impact, and the influence of network scale and dataset diversity, followed by future perspectives and a Q&A. The second speaker, Ted Xiao from Google DeepMind, presents on ‘What’s Missing for Robotics-First Foundation Models?’, identifying critical gaps such as positive transfer from scale, steerability, and scalable evaluation in the current landscape of robotics foundation models. The final speaker, Li Chen from OpenDriveLab, introduces ‘Visual World Models as “Foundation” Models for Autonomous Systems’, showcasing the OpenDV-2K dataset and the GenAD video prediction model for various autonomous driving prediction tasks. This segment concludes a presentation on visual world models (Vista, ViDAR, MPI) for autonomous driving and robotics, followed by a Q&A session. The core of the segment then transitions into a panel discussion titled ‘Challenges in Building Foundations Models for Embodied AI,’ moderated by Anthony Hu. The panel covers three main topics: technical challenges in building foundation models (data volume, diversity, quality, algorithmic advances), real-world challenges in embodied AI (specialization, latency, performance measurement), and integrating AI systems with humans (large-scale deployment, safety, trust). The discussion highlights the importance of diverse data, robust evaluation, and the need for collaboration to advance embodied AI.

Speakers

  • Hongyang Li — Shanghai AI Lab
  • Sergey Levine — UC Berkeley, Physical Intelligence
  • Sherry Yang — Google DeepMind, UC Berkeley
  • Alex Kendall — Wayve
  • NAVSIM Organizers — NAVSIM Organizers
  • Zhenxin Li — Team NVIDIA
  • Zetong Yang — OpenDriveLab
  • Haiming Zhang — Huawei Noah & CUHK-SZ Team
  • Dubing Chen — University of Macau, INCEPTIO
  • Wencheng Han — University of Macau, INCEPTIO
  • Jin Fang — University of Macau, INCEPTIO
  • Jianbing Shen — University of Macau, INCEPTIO
  • Panqu Wang — ZERON
  • Rareș Ambruș — Toyota Research Institute
  • Kuang Wu — Lange Technology
  • Sulei Nian — Lange Technology
  • Can Shen — Lange Technology
  • Chuan Yang — Lange Technology
  • Zhanbin Li — Lange Technology
  • Zhiqi Li — Nanjing University
  • Tong Lu — Nanjing University
  • Zhiqi Ding — NVIDIA
  • Matt — CARLA Simulator Development Team
  • Katrin Renz — University of Tübingen
  • Andrei Bursuc — valeo.ai
  • Ted Xiao — Google DeepMind
  • Li Chen — OpenDriveLab at Shanghai AI Lab
  • Anthony Hu — Wayve (Moderator)
  • Christos Sakaridis — ETH Zürich

Talks (22)

  • 00:28:45Sergey Levine: Robotic Foundation Models
    • Discusses the concept and challenges of building robotic foundation models, contrasting them with NLP models, and exploring approaches for navigation and manipulation.
  • 01:20:07Alex Kendall: The Road to Embodied AI
    • This talk introduces Wayve’s approach to building generalizable embodied AI for autonomous driving, focusing on simulation, multimodality, and scale, and highlighting their end-to-end learning system and safety by design.
  • 02:40:19NAVSIM Organizers: NAVSIM: Non-Reactive Autonomous Vehicle Simulation and Benchmarking
    • Introduction to the NAVSIM challenge, highlighting the inadequacy of simple displacement error metrics and introducing a new PDM score for robust evaluation of driving behavior, announcing NVIDIA and ZERON as leading teams.
  • 02:41:07Zhenxin Li: Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation – 1st Place Solution of End-to-end Driving at Scale
    • Presentation of NVIDIA’s 1st place solution, Hydra-MDP, for the NAVSIM challenge, detailing its multi-modal planning with multi-target learning paradigm, Transfuser-based perception, VAD v2 + Distance-based Imitation Loss trajectory decoder, and multi-target Hydra-Distillation approach.
  • 02:42:50Zetong Yang: Predictive World Model
    • Introduction to the Predictive World Model challenge, focusing on visual point cloud forecasting to enable intelligent agents to perceive, react, and predict, using Chamfer distance as a metric and emphasizing cross-modality learning and training efficiency.
  • 02:43:35Haiming Zhang: D²-World: An Efficient World Model through Decoupled Dynamic Flow
    • Presentation of D²-World, the 2nd place solution for the Predictive World Model challenge, which reformulates visual point cloud forecasting as a 4D occupancy forecasting task using a two-stage pipeline with BEVDET-Occ, SALT blocks, and flow-guided warping & refinement.
  • 04:01:38Dubing Chen, Wencheng Han, Jin Fang, Jianbing Shen: AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction
    • This talk introduces AdaOcc, a method that secured 2nd place in the CVPR 2024 Autonomous Grand Challenge for 3D Occupancy and Flow, utilizing a pipeline with image backbone, depth/context heads, view transformation, 3D U-Net, and task-specific heads.
  • 04:01:53Panqu Wang: End-to-End Autonomous Driving Using Vision Language Model
    • This presentation details ZERON’s 2nd place camera-only solution for end-to-end autonomous driving, leveraging a Vision Language Model and synthetic data for improved perception and robustness in traffic sign recognition.
  • 04:26:30Rareș Ambruș: Visual Foundation Models for Embodied Applications
    • This talk from Toyota Research Institute explores the development of visual foundation models for embodied AI, focusing on robust 3D perception, motion capture, object-centric representations, and interpretability of video transformers.
  • 05:23:06Kuang Wu: LGmap: Local-to-Global Mapping Network for Online Long-Range Vectorized HD Map Construction
    • This talk presents LGmap, a solution for vectorized HD map construction, detailing its pipeline, view transformation, temporal fusion strategies, and hierarchical pedestrian representation, along with ablation studies and challenge results.
  • 05:36:36Zhiqi Li: InternVL for DriveLM: CVPR 2024 Autonomous Grand Challenge
    • This presentation introduces InternVL for DriveLM, the winning solution for the DriveLM challenge, detailing its architecture based on InternVL-1.5, data processing for multi-view images, fine-tuning strategy, and evaluation results using GPT as a referee.
  • 05:43:36Zhiqi Ding: Driving with Multimodal LLM
    • This presentation details NVIDIA’s second-place solution for the Driving with Language Challenge, highlighting the use of multimodal LLMs like Eagle and OmniDrive for tasks such as important object identification, decision-making, motion prediction, and grounding, and showcasing their performance in various traffic scenarios.
  • 05:55:26Katrin Renz: CarLLaVA: Vision language models for camera-only closed-loop driving
    • This talk introduces CarLLaVA, a vision language model for camera-only closed-loop driving, emphasizing its architecture, training methodology, and performance in the CARLA challenge, highlighting the importance of generalization, reasoning, and explainability in autonomous driving.
  • 06:05:16Andrei Bursuc: Foundation models in the automotive industry
    • This presentation explores the application of foundation models in the automotive industry, detailing Valeo’s history in ADAS, their sensor suite, scalable system architecture, and research into self-supervised learning for image and LiDAR data, including methods like POP-3D, BEVContrast, OccFeat, and ScaLR.
  • 06:43:58Ted Xiao: What’s Missing for Robotics-First Foundation Models?
    • This talk discusses the current state of robotics foundation models, highlighting missing pieces like positive transfer from scale, steerability, and scalable evaluations, and proposes future directions for robotics research.
  • 07:27:27Li Chen: Visual World Models as “Foundation” Models for Autonomous Systems
    • This talk introduces the concept of visual world models as foundation models for autonomous systems, detailing the OpenDV-2K dataset, the GenAD video prediction model, and its application in zero-shot generalization, language-conditioned prediction, and action-conditioned prediction for driving scenarios.
  • 08:00:47Hongyang Li: Tasks Action-conditioned Prediction (Simulation)
    • Concluding remarks on Vista, ViDAR, and MPI models, their applications in driving and robotics, and future directions, emphasizing data scale, model capabilities, and application in policy prediction and reward learning.
  • 08:23:49Anthony Hu: Q&A on Visual World Models
    • Q&A session addressing data sources (OpenDV, YouTube), model training (LoRA), and challenges in applying visual world models to robotics, including the ‘chicken-and-egg’ problem of data scaling.
  • 08:31:34Anthony Hu: Panel / Debate: Challenges in Building Foundations Models for Embodied AI - Topic #1: Technical challenges in building a foundation model
    • Panel discussion on the technical challenges of building foundation models, focusing on data volume, diversity, quality, and algorithmic advances, with emphasis on the need for diverse, large-scale, and collaborative data collection.
  • 09:00:37Anthony Hu: Panel / Debate: Challenges in Building Foundations Models for Embodied AI - Topic #2: Real-world challenges in embodied AI
    • Panel discussion on real-world challenges in embodied AI, including specializing foundation models, real-time latency constraints, measuring performance, and the role of simulation and physical realism.
  • 09:11:09Anthony Hu: Panel / Debate: Challenges in Building Foundations Models for Embodied AI - Topic #3: Integrating AI systems with humans in everyday life
    • Panel discussion on integrating AI systems with humans, covering large-scale deployment of autonomous systems, safety-critical scenarios, and building public trust, emphasizing robust evaluation and communication.
  • 56:28:00Sherry Yang: Foundation Models as Real-World Simulators
    • Introduces the idea of using internet-scale data to learn real-world simulators (world models) for decision-making in robotics.

Key Takeaways

  • Foundation models, initially successful in NLP and vision, are being explored for autonomous systems to address data scarcity and improve generalization.
  • Large-scale, diverse datasets, including real-world and synthetic data, are crucial for training effective foundation models in robotics.
  • Cross-embodiment learning, where models are trained on data from various robot platforms and tasks, shows promise for improving generalization and performance.
  • The development of open-source robotic foundation models and standards (like IEEE P3474) is fostering collaboration and accelerating research in the field.
  • Embodied AI, particularly in autonomous driving, represents the next frontier in AI, requiring robust world models and decision-making algorithms.
  • Simulation is crucial for developing and validating embodied AI systems, enabling the generation of diverse synthetic data for training and testing edge cases.
  • Multimodal training, integrating vision, language, and action, is essential for building intelligent, explainable, and trustworthy autonomous systems.
  • Wayve’s end-to-end approach, leveraging foundation models and safety by design, aims to unlock safe and generalizable autonomous driving by learning from diverse data at scale.
  • Traditional displacement error metrics are insufficient for evaluating autonomous driving policies, necessitating more comprehensive simulation-based metrics like the PDM score.
  • Multi-target learning and distillation from rule-based teachers can significantly improve the performance and safety of end-to-end autonomous driving systems.
  • Predictive world models are crucial for enabling intelligent agents to understand and react to their environment, with 4D occupancy forecasting emerging as a key task.
  • Efficient architectures and training strategies are vital for handling the large datasets and complex tasks involved in developing robust autonomous driving systems.
  • Advanced 3D occupancy and flow prediction models are achieving high performance in autonomous driving challenges by integrating various perception and fusion techniques.
  • End-to-end autonomous driving solutions are evolving towards simpler, more generalizable vision language models, leveraging synthetic data for training and robustness.
  • Visual foundation models are being developed to address key challenges in embodied AI, focusing on robust 3D perception, motion understanding, and object-centric representations.
  • Self-supervised learning and novel neural field architectures are crucial for efficient and accurate 3D scene reconstruction, depth estimation, and object tracking across diverse domains.
  • The OpenDriveLab challenges foster innovation in autonomous driving, with solutions like LGmap demonstrating advancements in vectorized HD map construction and InternVL for DriveLM showcasing the power of vision language models.
  • Multimodal LLMs are proving highly effective in autonomous driving tasks, enabling advanced reasoning, decision-making, and explainability in complex traffic scenarios.
  • Self-supervised learning is a crucial paradigm for developing robust foundation models in the automotive industry, allowing for efficient pre-training on vast amounts of unlabeled data and adaptation to diverse sensor modalities and downstream tasks.
  • The CARLA Autonomous Driving Challenge provides a valuable platform for evaluating and accelerating progress in autonomous driving research, with top-performing teams leveraging sophisticated architectures and training strategies.
  • Simplifications in model architecture and loss functions can lead to improved performance and efficiency in multimodal learning for autonomous systems.
  • Achieving general-purpose robotics requires addressing missing pieces like positive transfer from scale, improved steerability and promptability, and scalable evaluation methodologies.
  • Future advancements in robotics foundation models will likely involve leveraging large-scale internet data, developing motion-centric representations, and improving data interoperability across different robot embodiments and tasks.
  • Visual world models, trained on massive datasets like OpenDV-2K, show promise in zero-shot generalization and language-conditioned prediction for autonomous driving, indicating a path towards more intelligent and reliable autonomous systems.
  • Large-scale, diverse, and high-quality datasets are crucial for training generalized visual world models, especially for autonomous driving and robotics, with collaborative data collection being a key challenge.
  • Self-supervised learning and efficient training strategies (like LoRA) enable models to learn robust representations and generalize across various tasks and conditions, even in zero-shot settings.
  • Visual world models can be effectively applied to complex tasks like action-conditioned prediction, motion planning, and reward learning, but real-world deployment in embodied AI faces challenges related to latency, performance measurement, and safety.
  • Building public trust and ensuring safety are paramount for integrating AI systems with humans in everyday life, requiring robust evaluation, clear communication, and addressing the unique challenges of human-AI interaction.

Methods / Models / Datasets Mentioned

  • 3D FPN
  • 3D Packing for Self-Supervised Monocular Depth Estimation
  • 3DCNN
  • 4D-Occ
  • ADriver-I
  • ALIGN
  • ActivityNet Captions
  • AdaOcc
  • AdamW optimizer
  • Any-point Trajectory Modeling
  • BERT
  • BEVContrast
  • BEVDET-Occ
  • BEVFormer
  • BLIP-2 Q-Former
  • Bevdet
  • Bridge Data
  • CAF
  • CARLA
  • CBGS (Class-balanced Grouping and Sampling)
  • CLIP
  • CLIP ViT
  • CLIP encoder
  • CLIP-RN
  • CLIP-Text
  • COMPASS
  • CenterPoint
  • Chamfer Distance
  • Class weighted CE loss
  • Common Crawl
  • Context Net
  • Convex NMF
  • CrossFormer
  • DCN3D
  • DINO (VIT-S)
  • DINO-RN
  • DINO-ViT
  • DINOV2 (VIT-B)
  • DINOV2 (VIT-L)
  • DMVFN
  • DeLiRA (Self-Supervised Depth, Light, and Radiance Fields)
  • Depth Net
  • Depth Semantic Fusion
  • Dgss-evlab
  • DinoV2
  • Distance-based Imitation Loss
  • DistilBERT
  • Drive-WM
  • DriveDreamer
  • DriveGAN
  • DriveSim
  • D²-World
  • EPIC-KITCHENS
  • Eagle (2D Multimodal LLM)
  • Ego4D
  • FBOC
  • FastRLAP
  • Forward View Transformation (FVT)
  • GAIA
  • GAIA-1
  • GNM
  • GPT-3.5
  • GPT-4
  • GUDA (Geometry-guided Unsupervised Domain Adaptation)
  • Gato
  • GenAD
  • Genie
  • Ghost Gym
  • GitHub
  • HTF
  • HTF (Hierarchical Temporal Fusion)
  • Habitat HM3D
  • History Fusion
  • Honk_4626_Team
  • Hydra-MDP
  • I2VGen-XL
  • IQL Training
  • Ins Det
  • Ins Seg
  • InternLM2-Chat-20B
  • InternVIT-6B
  • InternVL-1.5
  • InternVideo
  • Kubric
  • LGmap
  • LINGO-1
  • LINGO-2
  • LLM
  • LLT
  • LLaMA
  • LLaVA-NeXT Vision Encoder
  • LM-Nav
  • LORA finetuning
  • LSS
  • LTT
  • Language Table sim
  • Le-IDE3E
  • Lightwheel 3D Reconstruction
  • Llama 2
  • LoRA
  • Lovasz-softmax loss
  • MAE
  • MAE (VIT-B)
  • MLP
  • MPI
  • MachMap
  • MapTR
  • Mask-Based Loss
  • Matterport Room-to-Room scans
  • MinkNet 34
  • MinkUNet
  • MonoDepth Network
  • Multi-Frame Self-Supervised Depth with Transformers
  • Multi-target Hydra-Distillation
  • NAVSIM
  • NMS-Based Ensemble
  • NeRF 3D Reconstruction
  • Neural Ray Surfaces for Self-Supervised Learning of Depth and Ego-motion
  • OccFeat
  • Octo
  • OmniDrive (3D Multimodal LLM)
  • Open X-Embodiment
  • OpenDV
  • OpenDV-Youtube
  • OpenLane-V1
  • OpenLane-V2
  • OpenVLA
  • PDM Score
  • PDM-lite
  • POP-3D
  • PRISM-1
  • PaLI
  • PaLI-X
  • PaLM-E
  • PackNet-SfM
  • Predictive World Model
  • RLPD Training
  • ROAD (Recursive Octree Auto-Decoder)
  • RT-1
  • RT-1 Data
  • RT-2
  • RT-2-X
  • RT-Hierarchy
  • RT-Trajectory
  • RT-X
  • Ray Casting
  • RayIoU
  • RayL@1
  • RayL@2
  • RayL@4
  • ReFiNe (Recursive Field Networks for Cross-Modal Multi-Scene Representation)
  • ResNet
  • ResNet-50
  • Resnet-34
  • RoboTAP
  • Robust Semi-Supervised Monocular Depth Estimation with Reprojected Distances
  • S2Net
  • SDXL
  • SLIC (Simple Iterative Linear Clustering)
  • ST-P3
  • SVT
  • ScaLR
  • Segmentation Supervision
  • Self-Supervised Camera Self-Calibration from Video
  • Semantically-Guided Representation Learning for Self-Supervised Monocular Depth
  • SigLIP
  • Sora
  • Sparse Auxiliary Networks for Unified Monocular Depth Prediction and Completion
  • Spatial-aware Local-temporal Attention Block (SALT)
  • Stack Overflow
  • StyleGAN2
  • SuperDepth
  • Swin Base
  • Swin-Transformer
  • TAG (Tracking at Any Granularity)
  • TCOW
  • TopoMLP
  • Track2Act
  • Transfuser
  • Two Stream Networks for Self-Supervised Ego-Motion Estimation
  • UniAD
  • UniSim
  • UniTraj
  • VAD v2
  • VIDAR
  • VIT
  • VLM
  • VLP
  • VQ-GAN
  • VQ-VAE
  • VTCD (Understanding Video Transformers via Universal Concept Discovery)
  • Valeo4Cast
  • ViDAR
  • ViNT
  • ViT
  • Vicuna-7B
  • VideoCrafter1
  • VideoMAE - SSv2
  • Vision Language Model (LLM)
  • Vista
  • Waabi World
  • WaffleIron-256
  • Waymo's Waymax
  • WayveScenes101
  • Wikipedia
  • WoVoGen
  • WolframAlpha
  • YOLO
  • YouTube
  • ZeroDepth (Towards Zero-Shot Scale-Aware Monocular Depth Estimation)
  • mAVE
  • mAVE@LQ
  • mAVE@Per-voxel
  • mAVE@TP
  • nuPlan
  • nuScenes

Topics

3D Perception · AI Alignment · Autonomous Driving · Autonomous Systems · Benchmarking · CARLA Challenge · Computer Vision · Cross-Embodiment Learning · Data Diversity · Data Scaling · Data Scarcity · Decision Making · Deep Learning · Embodied AI · End-to-End Learning · Flow Prediction · Foundation Models · HD Maps · Human-AI Interaction · Interpretability · Lane Detection · Multi-modal Planning · Multimodal LLMs · Multimodality · Neural Rendering · Object Tracking · Occupancy Forecasting · Occupancy and Flow Prediction · Predictive World Model · Real-world Challenges · Robotics · Safety by Design · Scalable Evaluation · Self-Supervised Learning · Self-driving cars · Self-supervised Learning · Sensor Fusion · Simulation · Steerability · Vehicle Simulation · Video Prediction Models · Vision Language Models · Visual Foundation Models · Visual World Models · World Models


Notes

Open for commentary — connections to other work, critiques, follow-up reading.