23598 The 5th Annual Embodied AI Workshop

Event: CVPR 2024, Seattle · Duration: 416 min · ▶ Watch on YouTube

Abstract

This segment introduces the 5th Annual Embodied AI Workshop at CVPR 2024, detailing its organizers, scientific advisory board, and challenge organizers. It then presents three distinct challenges: MultiON, HAZARD, and PRS. The MultiON challenge focuses on multi-object navigation using natural language instructions in simulated environments. The HAZARD challenge addresses embodied decision-making in dynamically changing disaster environments like fire, flood, and wind. The PRS challenge introduces human-centered in-building delivery tasks for robots, emphasizing human-robot interaction and complex indoor navigation. The segment concludes with a keynote on the importance of scaling environments, data, and models for generalizable robots, introducing HoloDeck and Objverse as tools for generating diverse and realistic simulation data. This segment features a panel discussion on embodied AI, covering the evolution of robotics, the role of human-like form factors, and the challenges of integrating perception and control. Following the panel, Brian Ichter presents his work on “Foundation Models for Robotics and Robotics for Foundation Models.” He discusses the rapid advancements in Foundation Models (FMs) and their current limitations in real-world embodied interaction. Ichter introduces methods like PIVOT for visual iterative optimization in robotic control and Chain of Code for reasoning and code generation, demonstrating how these approaches can bridge the gap between FMs and robotics. He emphasizes that robotics can also contribute to improving FMs by providing rich, interactive, and spatially aware data, addressing current weaknesses in spatial reasoning and embodied understanding. This segment of the Embodied AI Workshop @ CVPR 2024 explores the current state and future challenges of embodied AI, focusing on the transition from internet-scale AI to contextual AI. Speakers discuss the limitations of current models, the need for egocentric data from wearables, and the immense data volume and privacy concerns associated with raw sensor streams. Various research projects and challenges are presented, including Project Aria, ManiSkill ViTac, and HomeRobot OVMM, which aim to develop predictive models of reality, enable robust robotic manipulation through tactile sensing, and facilitate open-vocabulary mobile manipulation in diverse environments. The discussions highlight the critical role of simulation, the complexities of sim-to-real transfer, and the importance of safety, affordability, and open-source development for deploying robots in real-world settings. Eric Jang discusses the challenges of scaling robotics models and data. He highlights that unlike NLP and vision, scaling up robotics models does not always lead to better performance, attributing this to issues like irreproducible real-world evaluation, data freshness, architectural bottlenecks, and the lack of sufficiently complex tasks. Jang emphasizes the need for more diverse and challenging tasks to truly push the boundaries of robotics AI. This segment features a panel discussion on the future of embodied AI, focusing on the role of planning, simulation, and real-world data. The panelists debate the necessity of explicit planning in complex, unstructured environments versus the effectiveness of end-to-end learning and greedy policies. They also discuss the challenges of data collection, the importance of high-quality data, and the potential for large language models to bridge the gap between human interaction and robotic capabilities. The conversation highlights the need for robust evaluation metrics and the ethical considerations of deploying robots in real-world settings.

Speakers

Anthony Francis — Logical Robotics
Sonia Raychaudhuri — SFU
Francesco Taioli — IntelliGO Labs
Qinhong Zhou — UMass Amherst
Hao Dong — Peking University
Ani Kembhavi — AI2 Allen Institute for AI
Ade Famoti — Microsoft Research
Ashley — Microsoft Research
Olivia Norton — Sanctuary AI
Stevie — Microsoft
Brian Ichter — Physical Intelligence
Richard Newcombe — Meta
Stone Tao — UC San Diego
Xiaofeng Gao
William Smith
Chris Paxton — Google
Luca Weihs — Allen Institute for AI
Eric Jang
Speaker 2
Speaker 3
Speaker 4

Talks (15)

00:00:00 — Anthony Francis: The Fifth Annual Embodied AI Workshop
- Anthony Francis introduces the 5th Annual Embodied AI Workshop at CVPR 2024, Seattle, highlighting its organizers, scientific advisory board, challenge organizers, and the workshop’s evolution over five years.
00:10:09 — Sonia Raychaudhuri: 4th Multi Object Navigation (MultiON) Challenge
- Sonia Raychaudhuri introduces the 4th MultiON Challenge, focusing on navigation to an ordered sequence of objects described by natural language instructions within simulated environments using the Habitat Synthetic Scenes Dataset (HSSD).
00:15:34 — Francesco Taioli: Vision-Language Foundation Models for Open-Set Object Navigation
- Francesco Taioli presents the winning solution for the MultiON challenge, detailing a method that combines label extraction using spaCy and LLMs with map querying via CLIP embeddings for open-set object navigation.
00:19:39 — Qinhong Zhou: HAZARD Embodied Decision Making in Dynamically Changing Environments
- Qinhong Zhou introduces the HAZARD challenge, which focuses on embodied decision-making in dynamically changing environments like fire, flood, and wind disasters, using the ThreeDWorld simulator.
00:27:24 — Hao Dong: PRS: Human-Centered In-Building Delivery Challenge
- Hao Dong introduces the PRS challenge, focusing on human-centered in-building delivery tasks for robots, where robots must reason about human locations, understand natural language instructions, and navigate complex indoor environments.
01:08:28 — Ani Kembhavi: The Blueprint for Truly Generalizable Robots: Scale, Scale and Scale
- Ani Kembhavi discusses the importance of scaling environments, data, and models to achieve truly generalizable robots, highlighting the limitations of current simulation and the need for high-quality, diverse data.
01:23:44 — Ade Famoti, Ashley, Olivia Norton, Stevie: Towards Seamless Integration of Perception and Action
- Panelists discuss the evolution of embodied AI, the role of human-like form factors, and the challenges of integrating perception and control in robotics.
01:26:27 — Brian Ichter: Foundation Models for Robotics and Robotics for Foundation Models
- Brian Ichter explores how foundation models can be applied to robotics and how robotics can contribute to improving foundation models, highlighting methods like PIVOT and Chain of Code.
02:46:15 — Richard Newcombe: From Internet Scale AI to Contextual AI
- Discusses the limitations of internet-scale AI lacking real-world context, introduces Project Aria for egocentric data collection, highlights data volume and privacy challenges, and proposes predictive models of reality from functional primitives for contextual AI.
03:04:35 — Stone Tao: ManiSkill ViTac Challenge
- Introduces the ManiSkill ViTac Challenge for vision-based tactile robotic manipulation, detailing the real-world setup, custom FEM+IPC simulator, efficient tactile signal representation, and accurate Sim2Real transfer, while also announcing future challenges.
03:21:50 — Xiaofeng Gao: Tasks and Demonstrations
- Presents a framework for language-grounded robotic manipulation tasks in household environments, using language instructions for goal states, expert demonstrations for data augmentation, and evaluating a 7-DoF Franka Emika Panda robot on 800 test cases.
03:34:15 — William Smith: HomeRobot Open Vocabulary Mobile Manipulation (OVMM) Challenge @ EAI CVPR 2024
- Introduces the HomeRobot OVMM Challenge, focusing on controlling a Stretch robot to move open-vocabulary objects between receptacles in unseen environments, with evaluation via simulation and zero-shot real robot testing based on overall success, partial success, and steps.
03:46:15 — Chris Paxton: None
- Discusses the challenges of creating diverse and realistic simulation environments for robotics, emphasizing the need for better sim-to-real transfer methods and and the importance of safety and ease of teleoperation for real-world robots.
04:00:15 — Luca Weihs: None
- Explores future directions in embodied AI, focusing on open-set generation of objects/environments using LLMs, fine-tuning paradigms, and the importance of safety, affordability, and open-source development for home robots.
04:09:22 — Eric Jang: Voice Commands & Chaining Tasks 1X AI Update
- Eric Jang discusses the challenges of scaling robotics models and data, highlighting issues like irreproducible real-world evaluation, data freshness, architectural bottlenecks, and the need for more diverse tasks to drive progress in robotics AI.

Key Takeaways

Embodied AI research is rapidly advancing, with a focus on agents that can perceive, act, reason, and converse, moving beyond simple robotics and computer vision tasks.
The workshop highlights the increasing complexity of Embodied AI, enabled by advancements in labeled datasets, MDPs, MPC, foundation models, and effective simulators, with challenges playing a crucial role in driving research.
New challenges like MultiON, HAZARD, and PRS are pushing the boundaries of robot capabilities in areas such as multi-object navigation, decision-making in dynamic disaster environments, and human-centered in-building delivery.
Achieving truly generalizable robots requires massive scaling of environments, data, and models, with procedural generation in simulators like HoloDeck and large object datasets like Objverse being key to overcoming real-world data collection limitations.
Foundation Models offer powerful reasoning and data processing capabilities that can be leveraged for complex robotic tasks.
Robotics provides a crucial feedback loop for FMs, offering real-world interaction data and embodied experience to overcome current limitations.
Hybrid approaches combining FMs with robotic control strategies (like PIVOT) and code generation (like Chain of Code) show promising results for robust and generalizable robotic behaviors.
Scaling these integrated approaches has the potential to bring general-purpose AI into the physical world, but requires addressing challenges in data acquisition, model architecture, and computational efficiency.
Current internet-scale AI lacks real-world context, necessitating egocentric data from wearables and the development of ‘always-on contextual AI’ for practical applications.
Collecting and processing raw egocentric sensor data presents significant challenges in terms of data volume (petabytes), computational requirements, and privacy, leading to a focus on predictive models of reality built from compressed functional primitives.
Simulation plays a crucial role in advancing embodied AI, with efforts focused on creating diverse, realistic, and open-vocabulary environments, but bridging the sim-to-real gap remains a major hurdle.
Future directions emphasize open-source development, community involvement, and the creation of affordable, safe, and easily teleoperable robots for home environments, with a strong focus on fine-tuning and robust data collection strategies.
Scaling laws observed in NLP and vision do not consistently apply to robotics, where increasing model parameters or data size does not always guarantee improved performance.
Challenges in robotics scaling include the irreproducibility of real-world evaluations, data freshness issues, architectural bottlenecks that restrict parameter efficiency, and the potential for current tasks to be insufficiently complex to leverage larger models.
There’s a need for better methods to count and evaluate data in robotics, as current approaches may overcount data or use loss functions (like MSE) that limit predictions to a single mode, hindering the benefits of scaling.
Future progress in robotics AI requires addressing these fundamental issues, potentially through new evaluation frameworks, more robust architectures, and the development of more diverse and challenging tasks that truly push the limits of current models.
The debate between explicit planning and end-to-end learning in embodied AI is ongoing, with arguments for both approaches depending on the complexity and structure of the task.
High-quality, diverse, and large-scale real-world data is crucial for advancing embodied AI, and simulation plays a vital role in generating synthetic data and evaluating policies.
The integration of large language models (LLMs) with robotic systems offers new possibilities for human-robot interaction, enabling robots to understand and execute complex, high-level instructions.
Safety and ethical considerations are paramount for deploying robots in real-world environments, requiring robust evaluation, alignment techniques, and the ability for robots to ‘say no’ to unsafe commands.

Methods / Models / Datasets Mentioned

1M DoH
AFRLed style API
AI2-THOR
ARNOLD
BIG-Bench Hard
Bard
CLIP
Chain of Code (CoC)
Chain of Thought
ChatGPT
Code-as-Policies
ConVOI
Coordinate Canonicalization
DROID dataset
DinoV2
EFM3D
Ego4D
FEM
Foundation Models
GENIE
GLM-4
GOAT
GPT-4
GPT-4V
Gaussian Splatting
Gemini
Grounding DINO
Grounding SAM
H1
HAZARD
HMD2
HSSD
Habitat
HoloDeck
HomeRobot OVMM
IEDL
IPC
ImageNet
Isaac Lab
Isaac Sim
Kinetics
LLM
LMM
LMulator
Llama-style LLM
MCTS
MDPs
MJAX
MOPA
MPC
ManiSkill
ManiSkill 3
ManiSkill ViTac Challenge
Mask2Former
MaskGIT
Metric Depth Estimation
Motion Planner
MultiON
Nymeria
OVMM
OXE dataset
Objverse
Octo
Octo 55B
Octo 93M
OpenVLA 55B
OpenVLA 7B
PDA (Object Deliver)
PIRLNav
PIVOT
PRM+RL
PRS Challenge
PartNet-Mobility
PerAct
Phone2Proc
PhysX (GPU)
PointNet
PoliFormer
Project Aria
Python
Q&A Synthesis
RT-1-X
RT-2-X
Ray-Ban Meta
RefCoCo
ReplicaCAD
RoboNet
Robotics Transformer 1 (RT-1)
SAPIEN
SLAM
SPOC (Shortest Path Oracle Agent)
Segmentation
Semantic Filtering
ShapeNet
Stretch robot
TSDF Fusion
ThreeDWorld
V-JEPA
VLFM
VLM
VLMaps
VQA model
Vader
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
gpt-3.5-turbo
gpt-4
spaCy
text-davinci-003

Topics

Architectural bottlenecks · Benchmarking · CVPR 2024 · Code Generation · Contextual AI · Data Privacy · Data freshness · Data quality · Dynamic Environments · Egocentric Data · Embodied AI · Foundation Models · Foundation models · Generalizable Robots · Human Motion Prediction · Human-Robot Interaction · Imitation learning · Iterative Optimization · Large Language Models · Multi-Object Navigation · Open Vocabulary · Planning · Real-world data · Real-world evaluation · Robot safety · Robotic Manipulation · Robotics · Robotics Challenges · Robotics Control · Scaling laws · Sim-to-Real Transfer · Simulation · Spatial Reasoning · Task complexity · Visual Language Models (VLMs) · Wearables

Notes

Open for commentary — connections to other work, critiques, follow-up reading.