Generalization via Scaling Robotics

Event: CVPR 2025 Workshop on X · Duration: 0 min · ▶ Watch on YouTube

Abstract

The presentation addresses the fundamental challenge of scaling robotics compared to fields like language and vision, which have benefited immensely from large-scale data. It highlights the scarcity of robotic interaction data due to expensive hardware and slow real-time collection. The talk investigates three main approaches: self-supervised robots, teleoperation, and Sim2Real, outlining their limitations. A core focus is on leveraging passive data (web videos and images) to reduce the need for active interaction data. Two main strategies are proposed: pre-training visual representations using human affordances and learning world models/factorized policies from passive data to enable robust generalization across various robotic tasks and environments.

Speakers

Abhinav Gupta — Carnegie Mellon University

Talks (1)

00:09:53 — Abhinav Gupta: Generalization via Scaling Robotics
- This talk explores the challenges of achieving generalization in robotics through scaling, contrasting it with advancements in language and vision models, and proposes methods using passive data and factorized policies to improve robot learning and generalization.

Key Takeaways

Image diversity is crucial for effective visual pre-training in robotics, with general datasets like ImageNet sometimes outperforming specialized human video datasets when using standard self-supervised learning.
Self-supervised algorithms, when applied directly to in-the-wild human videos (e.g., Ego4D), may not effectively learn robotic representations without additional guidance.
Incorporating human affordances (contact points, hand poses, active objects) during pre-training significantly boosts the performance of visual representations for robotic tasks, even for models initially trained on general image data.
Factorized policies, which separate visual interaction planning from robot execution, can enable strong generalization across different levels of unseen objects and activities in both table-top and in-the-wild scenarios.
Leveraging zero-shot video generation models to create visual interaction plans from language instructions allows for chaining multiple sub-tasks to achieve complex, long-horizon robotic activities.

Methods / Models / Datasets Mentioned

Ego4D
ImageNet
Kinetics
100 DoH
RoboNet
Masked Auto-Encoder (MAE)
VC-1
MVP
PVR
R3M
CLIP
DINO
HRP (Human Robot Plan)
ResNet-101
VIT-B Encoder
Diffusion Model
Transformer
VideoPoet
RT1
Gen2Act

Topics

Robotics Generalization · Scaling Robotics · Passive Data · Pre-training Visual Representations · Human Affordances · World Models · Factorized Policy · Visual Interaction Plans · Sim2Real · Teleoperation · Long-Horizon Tasks · Robot Manipulation

Notes

Open for commentary — connections to other work, critiques, follow-up reading.