23650 Vision and Language for Autonomous Driving and Robotics VLADR

Event: CVPR Workshop on Vision and Language for Autonomous Driving and Robotics, 2024 · Duration: 433 min · ▶ Watch on YouTube

Abstract

This video segment delves into the application of large pre-trained models for robot learning, starting with visual pre-training using masked autoencoders on diverse image datasets to enhance sample efficiency and generalizability across robot tasks. It then expands to sensorimotor pre-training, where a transformer model learns a physical world model by predicting missing elements in multimodal robot trajectories, leading to emergent behaviors like self-correction. The segment concludes by showcasing real-world humanoid locomotion achieved through reinforcement learning and a next-token prediction framework, demonstrating zero-shot transfer from simulation to complex real-world environments with robust and adaptive walking capabilities. This video segment primarily displays placeholder screens with speaker names ‘Paiva, Antonio’ and ‘liucheng’ against a black background. No discernible presentation content, discussions, or visual aids are present within this duration, suggesting a technical break or transition period. This segment explores the application of Visual Language Models (VLMs) and Foundation Models in robotics and autonomous driving, focusing on achieving generalizable motion-level intelligence. It covers methods like RT-2 for generalization and semantic reasoning, chain-of-thought reasoning for complex tasks, and generating reward code for robot control. The discussion highlights techniques for mining existing signals from VLMs, improving their core spatial reasoning capabilities through data engineering, and enabling proactive self-improvement via human interaction. The segment also introduces end-to-end autonomous driving systems, demonstrating how rules can emerge from data and showcasing their ability to handle complex and unusual driving scenarios. This video segment features multiple presentations on advancements in autonomous driving and robotic manipulation. It begins by detailing Wayve’s “Language in Driving” initiative, showcasing LINGO-1 and LINGO-2 models for grounded video question answering and language-action alignment. Subsequently, the RoboEXP system is introduced, demonstrating how foundation models can be used to build action-conditioned 3D scene graphs for interactive robotic exploration. The segment also covers DriveLM, an end-to-end autonomous driving system leveraging graph visual question answering, and concludes with a discussion on a novel collision avoidance metric for 3D camera evaluation.

Speakers

Ilya Radkevich — University of California, Berkeley
Paiva, Antonio
liucheng
Fei Xia — Google DeepMind
Long Chen — Wayve
Yunzhu Li — University of Illinois
Chonghao Sima — Shanghai AI Lab, University of Hong Kong
Vage (Vahe) Taamazyan

Talks (10)

01:27:07 — Ilya Radkevich: Real-World Robot Learning with Masked Visual Pre-training
- This talk introduces a method for visual pre-training in robotics using masked autoencoders on diverse image data, demonstrating improved sample efficiency and generalizability across various robot tasks.
01:34:22 — Ilya Radkevich: Real-World Humanoid Locomotion with Reinforcement Learning
- This part showcases a causal transformer model for humanoid locomotion trained in massively parallel simulation, achieving zero-shot transfer to the real world with robust and adaptive walking behaviors.
02:53:14 — Paiva, Antonio: Speaker Introduction
- A placeholder screen indicating the speaker ‘Paiva, Antonio’ is displayed.
03:34:45 — liucheng: Speaker Introduction
- A placeholder screen indicating the speaker ‘liucheng’ is displayed.
04:20:37 — Fei Xia: Towards Generalizable Motion-Level Intelligence with Foundation Models
- This talk explores using foundation models for motion-level intelligence in robotics, discussing generalization, chain-of-thought reasoning, and generating reward code, while introducing methods like PIVOT and SpatialVLM.
05:06:07 — Long Chen: Learning Principles of the World with Vision and Language for Autonomous Driving
- This talk introduces Wayve’s end-to-end AI system for autonomous driving, emphasizing data-driven learning of driving rules and principles of the world through vision and language models.
05:46:28 — Yunzhu Li: Language in Driving
- This part of the talk discusses the need for language in autonomous driving, comparing rule-based and LLM-based approaches, and presenting Wayve’s LINGO-1 and LINGO-2 models for grounded video question answering and language-action alignment in real-world driving.
07:46:28 — Yunzhu Li: RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation
- This talk introduces RoboEXP, a system that leverages foundation models to build action-conditioned 3D scene graphs for robotic manipulation, enabling robots to explore, understand, and memorize environments for complex downstream tasks.
07:48:58 — Chonghao Sima: Towards Language Model in End-to-end Autonomous Driving
- This talk presents DriveLM, an end-to-end autonomous driving system that uses a graph visual question answering approach to leverage vision-language models for improved generalization and explainability in complex driving scenarios.
07:49:51 — Vage (Vahe) Taamazyan: Collision Avoidance Metric For 3D Camera Evaluation
- This talk introduces a novel collision avoidance metric for evaluating 3D camera systems, focusing on detecting all possible collisions while avoiding false positives, and demonstrates its effectiveness compared to existing metrics.

Key Takeaways

Visual representations from internet videos.
Sensorimotor representations from robot data.
A learning approach for humanoid locomotion.
Humanoid control as next token prediction.
The segment features placeholder screens for two different speakers.
No presentation content or discussion is visible during this period.
Foundation models are increasingly being used for lower-level control in robotics, moving beyond high-level planning, and showing promising signs for generalizable motion-level intelligence.
Iterative visual prompting (PIVOT) can effectively elicit actionable knowledge from pre-trained VLMs for robot control tasks, even with weak signals, and scales with larger VLM models.
Improving the core spatial reasoning capabilities of VLMs through synthetic data generation and fine-tuning is crucial, as current VLMs struggle with precise spatial understanding.
End-to-end autonomous driving systems, like Wayve’s AV2.0, are demonstrating the emergence of complex driving rules directly from data, enabling robust performance in diverse and challenging real-world scenarios.
Language models can significantly enhance autonomous driving systems by providing better interpretability, handling long-tail scenarios through zero-shot reasoning, and improving scalability by leveraging internet-scale knowledge.
Wayve’s LINGO-1 and LINGO-2 models demonstrate the capability of grounded video question answering and language-action alignment, enabling autonomous vehicles to understand and communicate intentions, and even perform language-prompted driving in real-world scenarios.
RoboEXP utilizes action-conditioned 3D scene graphs and foundation models to allow robots to actively explore environments, identify objects requiring interaction, understand how to interact with them, and memorize information for complex downstream manipulation tasks.
A novel collision avoidance metric is proposed for 3D camera evaluation, which focuses on detecting all possible collisions while minimizing false positives, providing a more robust and interpretable measure of safety for autonomous systems.

Methods / Models / Datasets Mentioned

ALOHA
BLIP-2
Behavior Cloning
CARLA
CLIP
Causal Transformer
Chamfer Distance
DALL-E
DPDist
Diffusion Policy
Diffusion video decoder
Digit robot
DriveLM
Earth Mover's Distance
Ego4D
F-score
Franka
GPT
GPT-4V
GPT-X
Gemini
Graph Visual Question Answering
Hausdorff Distance
ImageNet
InstructBLIP
L2R
LINGO-1
LINGO-2
LLaVA-1.5
LMPC
LingoQA
LoRA
MOKA
Masked Autoencoder
MoCap
PIVOT
PaLI
PaLM
PaLM-E
RAG
RT-2
Reinforcement Learning
RoboEXP
RoboPoint
SAM
Sliced Wasserstein Distance
SpatialVLM
Transformer
VQ-GAN
Waymo
YouTube videos
nuScenes
xArm

Topics

Autonomous Driving · Chain-of-Thought Reasoning · Collision Avoidance Metrics · Common Sense Reasoning for Driving · Data Engineering for VLMs · Data-Driven Driving Systems · Emergence of Rules from Data · Emergent Behaviors · End-to-End Autonomous Driving · End-to-end Driving · Foundation Models · Foundation Models in Robotics · Generalization in Robotics · Generative World Models · Human-Robot Interaction · Humanoid Locomotion · Imitation Learning · Interactive Exploration · Iterative Visual Prompting · Large Pre-trained Models · Masked Autoencoders · Motion-Level Intelligence · Next Token Prediction · Reward Code Generation · Robot Learning · Robotic Manipulation · Scene Graphs · Self-Improving LLMs/VLMs · Sensorimotor Pre-training · Simulation to Real (Sim2Real) · Spatial Reasoning · Transformers · Vision-Language Models (VLM) · Visual Language Models (VLMs) · Visual Pre-training · Zero-Shot Transfer

Notes

Open for commentary — connections to other work, critiques, follow-up reading.