How to Train Your Humanoid: From Human Mesh Recovery to VideoMimic
Event: Global 3D Human Poses Workshop, CVPR 2025 · Duration: 25 min · ▶ Watch on YouTube
Abstract
This presentation explores the journey from human mesh recovery (HMR) to advanced humanoid robot control, emphasizing the critical role of video data. It showcases the evolution of HMR techniques, including robust single-view and hand mesh recovery, and introduces HSfM for joint 3D reconstruction of people, places, and cameras from multi-view and monocular video. The core innovation, VideoMimic, leverages these reconstructions to train generalist humanoid policies that can perform complex actions and interact contextually with diverse real-world environments, demonstrating zero-shot sim-to-real transfer. The talk also highlights the importance of useful tools like Visor for visualization and debugging in robotics research.
Speakers
- Angjoo Kanazawa — University of California, Berkeley
Talks (1)
- 00:00:00 — Angjoo Kanazawa: How to Train Your Humanoid: From Human Mesh Recovery to VideoMimic
- An overview of advancements in human mesh recovery and its application to training humanoid robots for contextual control using video data.
Key Takeaways
- Accurate 3D human mesh recovery from video is a foundational step for training humanoid robots to mimic complex human behaviors.
- Jointly optimizing for people, places, and cameras in 3D reconstruction, especially with human-centric scaling, provides crucial metric-scale and contextual information necessary for robust robot interaction.
- Contextual control policies, trained through a multi-stage process involving mocap pre-training, geometry-aware tracking, and distillation, enable robots to adapt to unseen environments and tasks without explicit joint-level instructions.
- The VideoMimic framework demonstrates the feasibility of zero-shot sim-to-real transfer for humanoid robots, allowing them to perform diverse skills in real-world settings based on learned policies from video demonstrations.
- Despite significant progress, challenges remain in monocular 4D human reconstruction, including camera drift sensitivity, low-texture environments, meshing artifacts, and the inherent lossiness of each stage in the real-to-sim-to-real pipeline.
Methods / Models / Datasets Mentioned
SMPLify 2016Human Mesh Recovery (CVPR 2018)Human Mesh Recovery 2.0 (ICCV 2023)HaMeR (CVPR 2024)Tesla 2025Ma et al., Learning Coordinated Badminton Skills for Legged ManipulatorsHe et al., Learning Human-to-Humanoid Real-Time Whole-body Teleoperation, 2024Rudin et al. Advanced skills by learning locomotion and local navigation end-to-end 2023SFV (Skill-from-Video, SIGGRAPH Asia 2018)SLAHMR (Ye et al. CVPR 2023)HSfMDust3rTRAMBsTroVITPoseMegaSAMPyRokiNKSR (Huang et al. CVPR '23)GeoCalib (Veicht et al. ECCV '24)DeepMimicNerfstudioViser
Topics
Humanoid robotics · Human mesh recovery · Video-based learning · Reinforcement learning · Contextual control · Sim-to-real transfer · 3D reconstruction · Motion tracking · Scene understanding
Notes
Open for commentary — connections to other work, critiques, follow-up reading.