23713 Towards Building AGI in Autonomy and Robotics

Event: CVPR 2024 Workshop · Duration: 173 min · ▶ Watch on YouTube

Abstract

This segment covers the introduction to the CVPR 2024 Workshop, focusing on Artificial General Intelligence (AGI) and its application in embodied systems like autonomous driving and robotics. It then features a keynote presentation by Kristen Grauman on the Ego(Exo)-4D dataset, detailing its multimodal, egocentric, and multi-view capture of everyday and skilled human activities, along with its privacy considerations and benchmark tasks. Following this, Chelsea Finn discusses humanoid and robot generalists, showcasing advancements in fine-grained manipulation, whole-body control, and the use of large-scale pre-trained vision-language-action models for efficient and generalizable robot learning. This segment features two talks on foundational models for autonomous driving. The first talk by Deva Ramanan introduces the concept of learning grounded foundational knowledge from massive 4D spatio-temporal sensor data using self-supervised 4D occupancy models. It highlights the challenges of directly predicting LiDAR points and proposes a solution based on differentiable rendering of 4D occupancy fields, which can improve existing annotations and enable cross-platform learning. The second talk by Chonghao Sima and Kashyap Chitta presents DriveLM, an end-to-end autonomous driving system that leverages Vision Language Models (VLMs) and Graph Visual Question Answering (GVQA) to address challenges in generalization, explainability, and interactivity. It details the evaluation methodology for VQA and motion prediction in driving scenarios and introduces NAVSIM, a lightweight non-reactive simulator for benchmarking foundation models.

Speakers

  • Kristen Grauman — University of Texas at Austin, FAIR, Meta
  • Chelsea Finn — Stanford University
  • Deva Ramanan — Carnegie Mellon University
  • Chonghao Sima — OpenDriveLab, Shanghai AI Lab, Shanghai Jiao Tong University
  • Kashyap Chitta

Talks (4)

  • 00:05:10Kristen Grauman: Ego(Exo)-4D: Everyday and Skilled Human Activity in First-Person Video
    • This talk introduces the Ego(Exo)-4D dataset, which captures everyday and skilled human activities from first-person and multi-view perspectives, highlighting its diverse data, modalities, and benchmark tasks for egocentric visual perception research.
  • 00:49:54Chelsea Finn: Humanoids and Robot Generalists
    • This talk explores the development of generalist robot policies for humanoids, focusing on fine-grained manipulation, whole-body control, and leveraging large-scale pre-trained vision-language models for robust and data-efficient learning.
  • 01:26:59Deva Ramanan: Learning to Plan in a Reactive World
    • Discusses the need for grounded foundational knowledge in robotics, proposing 4D occupancy models learned via self-supervision from massive spatio-temporal sensor data, and how this can be applied to motion planning and improve existing ground truth annotations.
  • 01:56:20Chonghao Sima and Kashyap Chitta: End-to-end Autonomous Driving At scale and with Language
    • Introduces DriveLM, an end-to-end autonomous driving system that integrates Vision Language Models (VLMs) with Graph Visual Question Answering (GVQA) to enhance generalization, explainability, and interactivity in driving scenarios, and presents NAVSIM for lightweight benchmarking.

Key Takeaways

  • The Ego(Exo)-4D dataset provides a rich, multimodal resource for understanding human activity from both first-person and multi-view perspectives, enabling research in areas like augmented reality, robot learning, and cognitive science.
  • Generalist robot policies, particularly those leveraging pre-trained vision-language models like OpenVLA, show significant promise in achieving strong visual generalization and efficient fine-tuning for new tasks, even outperforming larger proprietary models in some cases.
  • Teleoperation, especially with advanced systems like shadowing policies for humanoids, offers a low-cost and intuitive method for collecting demonstration data and training robots for complex, full-body control tasks in the real world.
  • The field is moving towards developing robot foundation models that can control various robot embodiments and generalize across diverse real-world scenarios, with open-source initiatives like OpenVLA playing a crucial role in community-driven progress.
  • Grounded foundational knowledge from massive 4D spatio-temporal data is crucial for robust autonomous driving, with 4D occupancy models offering a promising self-supervised learning approach.
  • Differentiable rendering of 4D occupancy fields can improve noisy ground truth annotations and enable cross-platform learning, moving towards modeling the world rather than just sensor data.
  • End-to-end autonomous driving systems leveraging Vision Language Models (VLMs) and Graph Visual Question Answering (GVQA) can enhance generalization, explainability, and interactivity, addressing limitations of traditional modular approaches.
  • Lightweight, non-reactive simulators like NAVSIM simplify benchmarking of foundation models for driving, allowing for efficient evaluation of diverse ideas and promoting innovation in the field.

Methods / Models / Datasets Mentioned

  • ACC
  • ALOHA
  • ARIA GLASSES
  • AVA
  • Action De-Tokenizer
  • ActivityNet
  • BLUE
  • BSD
  • BehaviorNet
  • CIDER
  • Caltech 101
  • Caltech 256
  • ChatGPT
  • DaVinci surgical robot
  • Depth Anything
  • Diffusion Policy
  • DinoV2
  • EPIC-Kitchens-100
  • EVA-02
  • Ego-Exo4D
  • Ego-Exo4D ego body pose
  • Ego-Exo4D ego-exo relation
  • Ego-Exo4D keystep recognition
  • Ego-Exo4D proficiency estimation
  • Ego4D
  • EvalAI
  • Gato
  • GoPro
  • HaMeR
  • Humanoid Shadowing Transformer (HST)
  • IDM
  • ImageNet
  • Kinetics
  • LabelMe
  • LaneGCN
  • Llama 2 7B
  • Llama Tokenizer
  • LoRA
  • MLP Projector
  • MS COCO
  • Mobile ALOHA
  • MuJoCo physics simulator
  • NAVSIM
  • Nerf
  • Octo
  • OpenScene
  • OpenVLA
  • PASCAL
  • PDM-C
  • Places
  • Prismatic 7B VLM
  • Pupil Labs
  • ROUGE_L
  • RT-1-X
  • RT-2
  • RT-2-X
  • Raster Model
  • S2Net
  • SORA
  • SPFNet
  • SUN
  • SigLIP
  • Tesla
  • Total-Recon
  • TransFuser
  • UrbanDriver
  • Visual Genome
  • Vuzix Blade
  • WHAM
  • Waymo
  • WeeView
  • ZShade
  • nuPlan

Topics

4D Occupancy Models · Artificial General Intelligence (AGI) · Autonomous Driving · Data-efficient fine-tuning · Differentiable Rendering · Egocentric vision · Embodied AI · First-person video datasets · Foundational Models · Graph Visual Question Answering (GVQA) · Humanoid robots · Model Predictive Control (MPC) · Multi-view human activity · Robot learning · Self-Supervision · Simulation and Benchmarking · Teleoperation · Vision Language Models (VLMs) · Vision-language models (VLMs)


Notes

Open for commentary — connections to other work, critiques, follow-up reading.