Estimating human motion in world coordinates

Event: CVPR Workshops, June 2025 · Duration: 35 min · ▶ Watch on YouTube

Abstract

The presentation delves into the critical task of estimating human motion within a global coordinate system, highlighting the limitations of traditional single-image, cropped-image, and local coordinate approaches. A significant portion is dedicated to the role of synthetic data, introducing BEDLAM and its enhanced successor, BEDLAM2.0, which offer richer diversity in scenes, camera motions, body shapes, and clothing. The speaker also presents CameraHMR and PromptHMR, novel methods designed to improve camera modeling and leverage scene context for more accurate world-coordinate pose estimation. The talk concludes by outlining future directions, including the integration of biomechanically accurate models and the pursuit of real-time performance.

Speakers

  • Michael J. Black — Max Planck Institute for Intelligent Systems

Talks (1)

  • 00:00 — Michael J. Black: Estimating human motion in world coordinates
    • This talk addresses the fundamental challenges of estimating 3D human motion in world coordinates from video, focusing on advancements in synthetic data generation and camera-aware pose estimation methods.

Key Takeaways

  • Current 3D human pose and shape estimation methods often suffer from limitations such as focusing on single, cropped images, using unrealistic camera models, and estimating poses in local rather than world coordinates.
  • Synthetic datasets like BEDLAM2.0 are critical for training robust models, providing diverse 3D scenes, varied camera motions (including zoom and shake), a wide range of body shapes (BMI), realistic clothing, hair, and shoes, along with perfect ground truth.
  • Accurate camera modeling, particularly addressing varying focal lengths and foreshortening, is crucial for improving the precision of 3D human pose estimation from monocular images.
  • Prompt-based Human Mesh Recovery (PromptHMR) leverages multimodal inputs like bounding boxes, masks, and text descriptions to provide side information, enabling more accurate and robust estimation of multiple people in complex scenes and world coordinates.
  • Future advancements in human motion estimation will focus on end-to-end training of human and camera motion, better body-scene contact estimation, incorporating biomechanical models (like SKEL), and exploiting richer semantic information from the scene to achieve Vicon-like accuracy from monocular video.

Methods / Models / Datasets Mentioned

  • HMR
  • BEDLAM
  • SMPL-X
  • SynBody
  • PACE
  • EgoGen
  • WHAC
  • HumanVid
  • BEDLAM2.0
  • AMASS
  • MOYO
  • BEAT2
  • ARCTIC
  • Unreal IK
  • CameraHMR
  • SMPLify
  • DenseKP
  • PromptHMR
  • ViT
  • Multi-HMR
  • TRAM
  • GVHMR
  • MoCapade3.0
  • SKEL
  • AddBiomechanics
  • SAM2
  • LLAVA
  • SPEC
  • CLIFF
  • HMR2.0a
  • TokenHMR
  • ReFit
  • BEDLAM-CLIFF
  • BEV
  • BUDDI

Topics

Human motion estimation · 3D human pose and shape · World coordinates · Synthetic data generation · Camera modeling · Foreshortening · Biomechanics · Vision foundation models · Prompt-based HMR · Foot-ground contact


Notes

Open for commentary — connections to other work, critiques, follow-up reading.