How to Train Your Humanoid: From Human Mesh Recovery to VideoMimic

Event: Global 3D Human Poses Workshop, CVPR 2025 · Duration: 25 min · ▶ Watch on YouTube

Abstract

This presentation explores the journey from human mesh recovery (HMR) to advanced humanoid robot control, emphasizing the critical role of video data. It showcases the evolution of HMR techniques, including robust single-view and hand mesh recovery, and introduces HSfM for joint 3D reconstruction of people, places, and cameras from multi-view and monocular video. The core innovation, VideoMimic, leverages these reconstructions to train generalist humanoid policies that can perform complex actions and interact contextually with diverse real-world environments, demonstrating zero-shot sim-to-real transfer. The talk also highlights the importance of useful tools like Visor for visualization and debugging in robotics research.

Speakers

Angjoo Kanazawa — University of California, Berkeley

Talks (1)

00:00:00 — Angjoo Kanazawa: How to Train Your Humanoid: From Human Mesh Recovery to VideoMimic
- An overview of advancements in human mesh recovery and its application to training humanoid robots for contextual control using video data.

Key Takeaways

Accurate 3D human mesh recovery from video is a foundational step for training humanoid robots to mimic complex human behaviors.
Jointly optimizing for people, places, and cameras in 3D reconstruction, especially with human-centric scaling, provides crucial metric-scale and contextual information necessary for robust robot interaction.
Contextual control policies, trained through a multi-stage process involving mocap pre-training, geometry-aware tracking, and distillation, enable robots to adapt to unseen environments and tasks without explicit joint-level instructions.
The VideoMimic framework demonstrates the feasibility of zero-shot sim-to-real transfer for humanoid robots, allowing them to perform diverse skills in real-world settings based on learned policies from video demonstrations.
Despite significant progress, challenges remain in monocular 4D human reconstruction, including camera drift sensitivity, low-texture environments, meshing artifacts, and the inherent lossiness of each stage in the real-to-sim-to-real pipeline.

Methods / Models / Datasets Mentioned

SMPLify 2016
Human Mesh Recovery (CVPR 2018)
Human Mesh Recovery 2.0 (ICCV 2023)
HaMeR (CVPR 2024)
Tesla 2025
Ma et al., Learning Coordinated Badminton Skills for Legged Manipulators
He et al., Learning Human-to-Humanoid Real-Time Whole-body Teleoperation, 2024
Rudin et al. Advanced skills by learning locomotion and local navigation end-to-end 2023
SFV (Skill-from-Video, SIGGRAPH Asia 2018)
SLAHMR (Ye et al. CVPR 2023)
HSfM
Dust3r
TRAM
BsTro
VITPose
MegaSAM
PyRoki
NKSR (Huang et al. CVPR '23)
GeoCalib (Veicht et al. ECCV '24)
DeepMimic
Nerfstudio
Viser

Topics

Humanoid robotics · Human mesh recovery · Video-based learning · Reinforcement learning · Contextual control · Sim-to-real transfer · 3D reconstruction · Motion tracking · Scene understanding

Notes

Open for commentary — connections to other work, critiques, follow-up reading.