23713 Towards Building AGI in Autonomy and Robotics
Event: CVPR 2024 Workshop · Duration: 173 min · ▶ Watch on YouTube
Abstract
This segment covers the introduction to the CVPR 2024 Workshop, focusing on Artificial General Intelligence (AGI) and its application in embodied systems like autonomous driving and robotics. It then features a keynote presentation by Kristen Grauman on the Ego(Exo)-4D dataset, detailing its multimodal, egocentric, and multi-view capture of everyday and skilled human activities, along with its privacy considerations and benchmark tasks. Following this, Chelsea Finn discusses humanoid and robot generalists, showcasing advancements in fine-grained manipulation, whole-body control, and the use of large-scale pre-trained vision-language-action models for efficient and generalizable robot learning. This segment features two talks on foundational models for autonomous driving. The first talk by Deva Ramanan introduces the concept of learning grounded foundational knowledge from massive 4D spatio-temporal sensor data using self-supervised 4D occupancy models. It highlights the challenges of directly predicting LiDAR points and proposes a solution based on differentiable rendering of 4D occupancy fields, which can improve existing annotations and enable cross-platform learning. The second talk by Chonghao Sima and Kashyap Chitta presents DriveLM, an end-to-end autonomous driving system that leverages Vision Language Models (VLMs) and Graph Visual Question Answering (GVQA) to address challenges in generalization, explainability, and interactivity. It details the evaluation methodology for VQA and motion prediction in driving scenarios and introduces NAVSIM, a lightweight non-reactive simulator for benchmarking foundation models.
Speakers
- Kristen Grauman — University of Texas at Austin, FAIR, Meta
- Chelsea Finn — Stanford University
- Deva Ramanan — Carnegie Mellon University
- Chonghao Sima — OpenDriveLab, Shanghai AI Lab, Shanghai Jiao Tong University
- Kashyap Chitta
Talks (4)
- 00:05:10 — Kristen Grauman: Ego(Exo)-4D: Everyday and Skilled Human Activity in First-Person Video
- This talk introduces the Ego(Exo)-4D dataset, which captures everyday and skilled human activities from first-person and multi-view perspectives, highlighting its diverse data, modalities, and benchmark tasks for egocentric visual perception research.
- 00:49:54 — Chelsea Finn: Humanoids and Robot Generalists
- This talk explores the development of generalist robot policies for humanoids, focusing on fine-grained manipulation, whole-body control, and leveraging large-scale pre-trained vision-language models for robust and data-efficient learning.
- 01:26:59 — Deva Ramanan: Learning to Plan in a Reactive World
- Discusses the need for grounded foundational knowledge in robotics, proposing 4D occupancy models learned via self-supervision from massive spatio-temporal sensor data, and how this can be applied to motion planning and improve existing ground truth annotations.
- 01:56:20 — Chonghao Sima and Kashyap Chitta: End-to-end Autonomous Driving At scale and with Language
- Introduces DriveLM, an end-to-end autonomous driving system that integrates Vision Language Models (VLMs) with Graph Visual Question Answering (GVQA) to enhance generalization, explainability, and interactivity in driving scenarios, and presents NAVSIM for lightweight benchmarking.
Key Takeaways
- The Ego(Exo)-4D dataset provides a rich, multimodal resource for understanding human activity from both first-person and multi-view perspectives, enabling research in areas like augmented reality, robot learning, and cognitive science.
- Generalist robot policies, particularly those leveraging pre-trained vision-language models like OpenVLA, show significant promise in achieving strong visual generalization and efficient fine-tuning for new tasks, even outperforming larger proprietary models in some cases.
- Teleoperation, especially with advanced systems like shadowing policies for humanoids, offers a low-cost and intuitive method for collecting demonstration data and training robots for complex, full-body control tasks in the real world.
- The field is moving towards developing robot foundation models that can control various robot embodiments and generalize across diverse real-world scenarios, with open-source initiatives like OpenVLA playing a crucial role in community-driven progress.
- Grounded foundational knowledge from massive 4D spatio-temporal data is crucial for robust autonomous driving, with 4D occupancy models offering a promising self-supervised learning approach.
- Differentiable rendering of 4D occupancy fields can improve noisy ground truth annotations and enable cross-platform learning, moving towards modeling the world rather than just sensor data.
- End-to-end autonomous driving systems leveraging Vision Language Models (VLMs) and Graph Visual Question Answering (GVQA) can enhance generalization, explainability, and interactivity, addressing limitations of traditional modular approaches.
- Lightweight, non-reactive simulators like NAVSIM simplify benchmarking of foundation models for driving, allowing for efficient evaluation of diverse ideas and promoting innovation in the field.
Methods / Models / Datasets Mentioned
ACCALOHAARIA GLASSESAVAAction De-TokenizerActivityNetBLUEBSDBehaviorNetCIDERCaltech 101Caltech 256ChatGPTDaVinci surgical robotDepth AnythingDiffusion PolicyDinoV2EPIC-Kitchens-100EVA-02Ego-Exo4DEgo-Exo4D ego body poseEgo-Exo4D ego-exo relationEgo-Exo4D keystep recognitionEgo-Exo4D proficiency estimationEgo4DEvalAIGatoGoProHaMeRHumanoid Shadowing Transformer (HST)IDMImageNetKineticsLabelMeLaneGCNLlama 2 7BLlama TokenizerLoRAMLP ProjectorMS COCOMobile ALOHAMuJoCo physics simulatorNAVSIMNerfOctoOpenSceneOpenVLAPASCALPDM-CPlacesPrismatic 7B VLMPupil LabsROUGE_LRT-1-XRT-2RT-2-XRaster ModelS2NetSORASPFNetSUNSigLIPTeslaTotal-ReconTransFuserUrbanDriverVisual GenomeVuzix BladeWHAMWaymoWeeViewZShadenuPlan
Topics
4D Occupancy Models · Artificial General Intelligence (AGI) · Autonomous Driving · Data-efficient fine-tuning · Differentiable Rendering · Egocentric vision · Embodied AI · First-person video datasets · Foundational Models · Graph Visual Question Answering (GVQA) · Humanoid robots · Model Predictive Control (MPC) · Multi-view human activity · Robot learning · Self-Supervision · Simulation and Benchmarking · Teleoperation · Vision Language Models (VLMs) · Vision-language models (VLMs)
Notes
Open for commentary — connections to other work, critiques, follow-up reading.