Generalization via Scaling Robotics
Event: CVPR 2025 Workshop on X · Duration: 0 min · ▶ Watch on YouTube
Abstract
The presentation addresses the fundamental challenge of scaling robotics compared to fields like language and vision, which have benefited immensely from large-scale data. It highlights the scarcity of robotic interaction data due to expensive hardware and slow real-time collection. The talk investigates three main approaches: self-supervised robots, teleoperation, and Sim2Real, outlining their limitations. A core focus is on leveraging passive data (web videos and images) to reduce the need for active interaction data. Two main strategies are proposed: pre-training visual representations using human affordances and learning world models/factorized policies from passive data to enable robust generalization across various robotic tasks and environments.
Speakers
- Abhinav Gupta — Carnegie Mellon University
Talks (1)
- 00:09:53 — Abhinav Gupta: Generalization via Scaling Robotics
- This talk explores the challenges of achieving generalization in robotics through scaling, contrasting it with advancements in language and vision models, and proposes methods using passive data and factorized policies to improve robot learning and generalization.
Key Takeaways
- Image diversity is crucial for effective visual pre-training in robotics, with general datasets like ImageNet sometimes outperforming specialized human video datasets when using standard self-supervised learning.
- Self-supervised algorithms, when applied directly to in-the-wild human videos (e.g., Ego4D), may not effectively learn robotic representations without additional guidance.
- Incorporating human affordances (contact points, hand poses, active objects) during pre-training significantly boosts the performance of visual representations for robotic tasks, even for models initially trained on general image data.
- Factorized policies, which separate visual interaction planning from robot execution, can enable strong generalization across different levels of unseen objects and activities in both table-top and in-the-wild scenarios.
- Leveraging zero-shot video generation models to create visual interaction plans from language instructions allows for chaining multiple sub-tasks to achieve complex, long-horizon robotic activities.
Methods / Models / Datasets Mentioned
Ego4DImageNetKinetics100 DoHRoboNetMasked Auto-Encoder (MAE)VC-1MVPPVRR3MCLIPDINOHRP (Human Robot Plan)ResNet-101VIT-B EncoderDiffusion ModelTransformerVideoPoetRT1Gen2Act
Topics
Robotics Generalization · Scaling Robotics · Passive Data · Pre-training Visual Representations · Human Affordances · World Models · Factorized Policy · Visual Interaction Plans · Sim2Real · Teleoperation · Long-Horizon Tasks · Robot Manipulation
Notes
Open for commentary — connections to other work, critiques, follow-up reading.