23713 Towards Building AGI in Autonomy and Robotics

Event: CVPR 2024 Workshop · Duration: 173 min · ▶ Watch on YouTube

Abstract

This segment covers the introduction to the CVPR 2024 Workshop, focusing on Artificial General Intelligence (AGI) and its application in embodied systems like autonomous driving and robotics. It then features a keynote presentation by Kristen Grauman on the Ego(Exo)-4D dataset, detailing its multimodal, egocentric, and multi-view capture of everyday and skilled human activities, along with its privacy considerations and benchmark tasks. Following this, Chelsea Finn discusses humanoid and robot generalists, showcasing advancements in fine-grained manipulation, whole-body control, and the use of large-scale pre-trained vision-language-action models for efficient and generalizable robot learning. This segment features two talks on foundational models for autonomous driving. The first talk by Deva Ramanan introduces the concept of learning grounded foundational knowledge from massive 4D spatio-temporal sensor data using self-supervised 4D occupancy models. It highlights the challenges of directly predicting LiDAR points and proposes a solution based on differentiable rendering of 4D occupancy fields, which can improve existing annotations and enable cross-platform learning. The second talk by Chonghao Sima and Kashyap Chitta presents DriveLM, an end-to-end autonomous driving system that leverages Vision Language Models (VLMs) and Graph Visual Question Answering (GVQA) to address challenges in generalization, explainability, and interactivity. It details the evaluation methodology for VQA and motion prediction in driving scenarios and introduces NAVSIM, a lightweight non-reactive simulator for benchmarking foundation models.

Speakers

Kristen Grauman — University of Texas at Austin, FAIR, Meta
Chelsea Finn — Stanford University
Deva Ramanan — Carnegie Mellon University
Chonghao Sima — OpenDriveLab, Shanghai AI Lab, Shanghai Jiao Tong University
Kashyap Chitta

Talks (4)

00:05:10 — Kristen Grauman: Ego(Exo)-4D: Everyday and Skilled Human Activity in First-Person Video
- This talk introduces the Ego(Exo)-4D dataset, which captures everyday and skilled human activities from first-person and multi-view perspectives, highlighting its diverse data, modalities, and benchmark tasks for egocentric visual perception research.
00:49:54 — Chelsea Finn: Humanoids and Robot Generalists
- This talk explores the development of generalist robot policies for humanoids, focusing on fine-grained manipulation, whole-body control, and leveraging large-scale pre-trained vision-language models for robust and data-efficient learning.
01:26:59 — Deva Ramanan: Learning to Plan in a Reactive World
- Discusses the need for grounded foundational knowledge in robotics, proposing 4D occupancy models learned via self-supervision from massive spatio-temporal sensor data, and how this can be applied to motion planning and improve existing ground truth annotations.
01:56:20 — Chonghao Sima and Kashyap Chitta: End-to-end Autonomous Driving At scale and with Language
- Introduces DriveLM, an end-to-end autonomous driving system that integrates Vision Language Models (VLMs) with Graph Visual Question Answering (GVQA) to enhance generalization, explainability, and interactivity in driving scenarios, and presents NAVSIM for lightweight benchmarking.

Key Takeaways

The Ego(Exo)-4D dataset provides a rich, multimodal resource for understanding human activity from both first-person and multi-view perspectives, enabling research in areas like augmented reality, robot learning, and cognitive science.
Generalist robot policies, particularly those leveraging pre-trained vision-language models like OpenVLA, show significant promise in achieving strong visual generalization and efficient fine-tuning for new tasks, even outperforming larger proprietary models in some cases.
Teleoperation, especially with advanced systems like shadowing policies for humanoids, offers a low-cost and intuitive method for collecting demonstration data and training robots for complex, full-body control tasks in the real world.
The field is moving towards developing robot foundation models that can control various robot embodiments and generalize across diverse real-world scenarios, with open-source initiatives like OpenVLA playing a crucial role in community-driven progress.
Grounded foundational knowledge from massive 4D spatio-temporal data is crucial for robust autonomous driving, with 4D occupancy models offering a promising self-supervised learning approach.
Differentiable rendering of 4D occupancy fields can improve noisy ground truth annotations and enable cross-platform learning, moving towards modeling the world rather than just sensor data.
End-to-end autonomous driving systems leveraging Vision Language Models (VLMs) and Graph Visual Question Answering (GVQA) can enhance generalization, explainability, and interactivity, addressing limitations of traditional modular approaches.
Lightweight, non-reactive simulators like NAVSIM simplify benchmarking of foundation models for driving, allowing for efficient evaluation of diverse ideas and promoting innovation in the field.

Methods / Models / Datasets Mentioned

ACC
ALOHA
ARIA GLASSES
AVA
Action De-Tokenizer
ActivityNet
BLUE
BSD
BehaviorNet
CIDER
Caltech 101
Caltech 256
ChatGPT
DaVinci surgical robot
Depth Anything
Diffusion Policy
DinoV2
EPIC-Kitchens-100
EVA-02
Ego-Exo4D
Ego-Exo4D ego body pose
Ego-Exo4D ego-exo relation
Ego-Exo4D keystep recognition
Ego-Exo4D proficiency estimation
Ego4D
EvalAI
Gato
GoPro
HaMeR
Humanoid Shadowing Transformer (HST)
IDM
ImageNet
Kinetics
LabelMe
LaneGCN
Llama 2 7B
Llama Tokenizer
LoRA
MLP Projector
MS COCO
Mobile ALOHA
MuJoCo physics simulator
NAVSIM
Nerf
Octo
OpenScene
OpenVLA
PASCAL
PDM-C
Places
Prismatic 7B VLM
Pupil Labs
ROUGE_L
RT-1-X
RT-2
RT-2-X
Raster Model
S2Net
SORA
SPFNet
SUN
SigLIP
Tesla
Total-Recon
TransFuser
UrbanDriver
Visual Genome
Vuzix Blade
WHAM
Waymo
WeeView
ZShade
nuPlan

Topics

4D Occupancy Models · Artificial General Intelligence (AGI) · Autonomous Driving · Data-efficient fine-tuning · Differentiable Rendering · Egocentric vision · Embodied AI · First-person video datasets · Foundational Models · Graph Visual Question Answering (GVQA) · Humanoid robots · Model Predictive Control (MPC) · Multi-view human activity · Robot learning · Self-Supervision · Simulation and Benchmarking · Teleoperation · Vision Language Models (VLMs) · Vision-language models (VLMs)

Notes

Open for commentary — connections to other work, critiques, follow-up reading.