23650 Vision and Language for Autonomous Driving and Robotics VLADR

Event: CVPR Workshop on Vision and Language for Autonomous Driving and Robotics, 2024 · Duration: 433 min · ▶ Watch on YouTube

Abstract

This video segment delves into the application of large pre-trained models for robot learning, starting with visual pre-training using masked autoencoders on diverse image datasets to enhance sample efficiency and generalizability across robot tasks. It then expands to sensorimotor pre-training, where a transformer model learns a physical world model by predicting missing elements in multimodal robot trajectories, leading to emergent behaviors like self-correction. The segment concludes by showcasing real-world humanoid locomotion achieved through reinforcement learning and a next-token prediction framework, demonstrating zero-shot transfer from simulation to complex real-world environments with robust and adaptive walking capabilities. This video segment primarily displays placeholder screens with speaker names ‘Paiva, Antonio’ and ‘liucheng’ against a black background. No discernible presentation content, discussions, or visual aids are present within this duration, suggesting a technical break or transition period. This segment explores the application of Visual Language Models (VLMs) and Foundation Models in robotics and autonomous driving, focusing on achieving generalizable motion-level intelligence. It covers methods like RT-2 for generalization and semantic reasoning, chain-of-thought reasoning for complex tasks, and generating reward code for robot control. The discussion highlights techniques for mining existing signals from VLMs, improving their core spatial reasoning capabilities through data engineering, and enabling proactive self-improvement via human interaction. The segment also introduces end-to-end autonomous driving systems, demonstrating how rules can emerge from data and showcasing their ability to handle complex and unusual driving scenarios. This video segment features multiple presentations on advancements in autonomous driving and robotic manipulation. It begins by detailing Wayve’s “Language in Driving” initiative, showcasing LINGO-1 and LINGO-2 models for grounded video question answering and language-action alignment. Subsequently, the RoboEXP system is introduced, demonstrating how foundation models can be used to build action-conditioned 3D scene graphs for interactive robotic exploration. The segment also covers DriveLM, an end-to-end autonomous driving system leveraging graph visual question answering, and concludes with a discussion on a novel collision avoidance metric for 3D camera evaluation.

Speakers

  • Ilya Radkevich — University of California, Berkeley
  • Paiva, Antonio
  • liucheng
  • Fei Xia — Google DeepMind
  • Long Chen — Wayve
  • Yunzhu Li — University of Illinois
  • Chonghao Sima — Shanghai AI Lab, University of Hong Kong
  • Vage (Vahe) Taamazyan

Talks (10)

  • 01:27:07Ilya Radkevich: Real-World Robot Learning with Masked Visual Pre-training
    • This talk introduces a method for visual pre-training in robotics using masked autoencoders on diverse image data, demonstrating improved sample efficiency and generalizability across various robot tasks.
  • 01:34:22Ilya Radkevich: Real-World Humanoid Locomotion with Reinforcement Learning
    • This part showcases a causal transformer model for humanoid locomotion trained in massively parallel simulation, achieving zero-shot transfer to the real world with robust and adaptive walking behaviors.
  • 02:53:14Paiva, Antonio: Speaker Introduction
    • A placeholder screen indicating the speaker ‘Paiva, Antonio’ is displayed.
  • 03:34:45liucheng: Speaker Introduction
    • A placeholder screen indicating the speaker ‘liucheng’ is displayed.
  • 04:20:37Fei Xia: Towards Generalizable Motion-Level Intelligence with Foundation Models
    • This talk explores using foundation models for motion-level intelligence in robotics, discussing generalization, chain-of-thought reasoning, and generating reward code, while introducing methods like PIVOT and SpatialVLM.
  • 05:06:07Long Chen: Learning Principles of the World with Vision and Language for Autonomous Driving
    • This talk introduces Wayve’s end-to-end AI system for autonomous driving, emphasizing data-driven learning of driving rules and principles of the world through vision and language models.
  • 05:46:28Yunzhu Li: Language in Driving
    • This part of the talk discusses the need for language in autonomous driving, comparing rule-based and LLM-based approaches, and presenting Wayve’s LINGO-1 and LINGO-2 models for grounded video question answering and language-action alignment in real-world driving.
  • 07:46:28Yunzhu Li: RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation
    • This talk introduces RoboEXP, a system that leverages foundation models to build action-conditioned 3D scene graphs for robotic manipulation, enabling robots to explore, understand, and memorize environments for complex downstream tasks.
  • 07:48:58Chonghao Sima: Towards Language Model in End-to-end Autonomous Driving
    • This talk presents DriveLM, an end-to-end autonomous driving system that uses a graph visual question answering approach to leverage vision-language models for improved generalization and explainability in complex driving scenarios.
  • 07:49:51Vage (Vahe) Taamazyan: Collision Avoidance Metric For 3D Camera Evaluation
    • This talk introduces a novel collision avoidance metric for evaluating 3D camera systems, focusing on detecting all possible collisions while avoiding false positives, and demonstrates its effectiveness compared to existing metrics.

Key Takeaways

  • Visual representations from internet videos.
  • Sensorimotor representations from robot data.
  • A learning approach for humanoid locomotion.
  • Humanoid control as next token prediction.
  • The segment features placeholder screens for two different speakers.
  • No presentation content or discussion is visible during this period.
  • Foundation models are increasingly being used for lower-level control in robotics, moving beyond high-level planning, and showing promising signs for generalizable motion-level intelligence.
  • Iterative visual prompting (PIVOT) can effectively elicit actionable knowledge from pre-trained VLMs for robot control tasks, even with weak signals, and scales with larger VLM models.
  • Improving the core spatial reasoning capabilities of VLMs through synthetic data generation and fine-tuning is crucial, as current VLMs struggle with precise spatial understanding.
  • End-to-end autonomous driving systems, like Wayve’s AV2.0, are demonstrating the emergence of complex driving rules directly from data, enabling robust performance in diverse and challenging real-world scenarios.
  • Language models can significantly enhance autonomous driving systems by providing better interpretability, handling long-tail scenarios through zero-shot reasoning, and improving scalability by leveraging internet-scale knowledge.
  • Wayve’s LINGO-1 and LINGO-2 models demonstrate the capability of grounded video question answering and language-action alignment, enabling autonomous vehicles to understand and communicate intentions, and even perform language-prompted driving in real-world scenarios.
  • RoboEXP utilizes action-conditioned 3D scene graphs and foundation models to allow robots to actively explore environments, identify objects requiring interaction, understand how to interact with them, and memorize information for complex downstream manipulation tasks.
  • A novel collision avoidance metric is proposed for 3D camera evaluation, which focuses on detecting all possible collisions while minimizing false positives, providing a more robust and interpretable measure of safety for autonomous systems.

Methods / Models / Datasets Mentioned

  • ALOHA
  • BLIP-2
  • Behavior Cloning
  • CARLA
  • CLIP
  • Causal Transformer
  • Chamfer Distance
  • DALL-E
  • DPDist
  • Diffusion Policy
  • Diffusion video decoder
  • Digit robot
  • DriveLM
  • Earth Mover's Distance
  • Ego4D
  • F-score
  • Franka
  • GPT
  • GPT-4V
  • GPT-X
  • Gemini
  • Graph Visual Question Answering
  • Hausdorff Distance
  • ImageNet
  • InstructBLIP
  • L2R
  • LINGO-1
  • LINGO-2
  • LLaVA-1.5
  • LMPC
  • LingoQA
  • LoRA
  • MOKA
  • Masked Autoencoder
  • MoCap
  • PIVOT
  • PaLI
  • PaLM
  • PaLM-E
  • RAG
  • RT-2
  • Reinforcement Learning
  • RoboEXP
  • RoboPoint
  • SAM
  • Sliced Wasserstein Distance
  • SpatialVLM
  • Transformer
  • VQ-GAN
  • Waymo
  • YouTube videos
  • nuScenes
  • xArm

Topics

Autonomous Driving · Chain-of-Thought Reasoning · Collision Avoidance Metrics · Common Sense Reasoning for Driving · Data Engineering for VLMs · Data-Driven Driving Systems · Emergence of Rules from Data · Emergent Behaviors · End-to-End Autonomous Driving · End-to-end Driving · Foundation Models · Foundation Models in Robotics · Generalization in Robotics · Generative World Models · Human-Robot Interaction · Humanoid Locomotion · Imitation Learning · Interactive Exploration · Iterative Visual Prompting · Large Pre-trained Models · Masked Autoencoders · Motion-Level Intelligence · Next Token Prediction · Reward Code Generation · Robot Learning · Robotic Manipulation · Scene Graphs · Self-Improving LLMs/VLMs · Sensorimotor Pre-training · Simulation to Real (Sim2Real) · Spatial Reasoning · Transformers · Vision-Language Models (VLM) · Visual Language Models (VLMs) · Visual Pre-training · Zero-Shot Transfer


Notes

Open for commentary — connections to other work, critiques, follow-up reading.