Towards the 3D Human Foundation Agent
Event: CVPR Workshops, June 2024 · Duration: 216 min · ▶ Watch on YouTube
Abstract
This workshop explores the concept of a 3D Human Foundation Agent (HFA), aiming to make computers more human-like by integrating perception, reasoning, embodiment, and behavior. Speakers discuss the critical need for high-quality, diverse, and dynamic human motion data, highlighting challenges in data collection, particularly for embodied intelligence and physical interactions. The session also introduces novel approaches like ChatPose, which leverages large language models to understand and generate 3D human poses from natural language, demonstrating progress in reasoning about human behavior and generalizing to complex scenarios. The discussions emphasize the importance of synthetic data, global coordinate estimation, and efficient computational methods to advance the field.
Speakers
- Michael J. Black — Max Planck Institute for Intelligent Systems & Meshcapade
- Karen Liu — Stanford University
- Siwei Zhang — Meta Reality Labs
Talks (3)
- 00:00:00 — Michael J. Black: Towards the 3D Human Foundation Agent
- Introduces the concept of a Human Foundation Agent (HFA) and its four key components: perception, reasoning, embodiment, and behavior, emphasizing the need for making computers more human-like.
- 01:08:00 — Karen Liu: Data Collection for Embodied Intelligence
- Discusses the challenges and approaches to collecting high-quality, diverse, and dynamic human motion data for embodied intelligence, highlighting the DexCap system and the need for contextual and physical interaction data.
- 02:22:00 — Siwei Zhang: ChatPose: Chatting about 3D Human Pose
- Introduces ChatPose, a large language model capable of understanding and generating 3D human poses from natural language descriptions and images, demonstrating its ability to reason about poses and generalize to extreme occlusion cases.
Methods / Models / Datasets Mentioned
SMPL 2015Loper, Mahmood, Romero, Pons-Moll, BlackNeRFsGaussian SplattingHUGS: Human Gaussian Splats, Kocabas, et al., CVPR'24HAARTokenHMRTokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation, Dwivedi et al. CVPR 2024VQ-VAEWHAM: Reconstructing World-grounded Humans with Accurate 3D Motion, Soyong Shin, Juyong Kim, Eni Halilaj, Michael J. Black, CVPR 2024DPVO (+ HMR2.0) [6, 45]GLAMR [54]TRACE [43]SLAHMR [52]DPVO [45]WHAM (w/ DPVO [45])WHAM (w/ DROID [44])WHAM (w/ GT gyro)SLAHMR [4]PACE [5]EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling, Haiyang Liu, et al. CVPR 2024EMAGEEmotional Speech-Driven 3D Body Animation via Disentangled Latent Diffusion, Kiran Chhatre, et al. CVPR 2024AMUSEPoseScript: 3D Human Poses from Natural Language, Delmas et al. ECCV 2022BEDLAMHMRGPT-4oGPT-4DALL-EHMR 2.0LLaVALLaVA*-SLLaVA-PGPT4-SGPT4-PChatPoseDexCapNVIDIA CUDAImageNetEGO-EXO4DHARMONY-4DSMPLOlympicsSMPLMask2FormerNymmeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild, Lingni Ma et al, 2024HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Devices, arXiv soonWANDR: Intention-guided Human Motion Generation, Markos Diomataris, et al.SCULPT: Shape-Conditioned Unpaired Learning of Pose dependent Clothed and Textured Human Meshes, Sanyal et al.HIT: Estimating Internal Human Implicit Tissues from the Body Surface, Keller, et al.VAREN: Very Accurate and Realistic Equine Network, Zuffi, et al.Generative Proxemics: A Prior for 3D Social Interaction from Images, Müller, et al.
Topics
Human Foundation Agent (HFA) · 3D human pose estimation · Embodied intelligence · Motion synthesis · Data collection · Perception · Reasoning · Embodiment · Behavior · Synthetic data · Global coordinates · Large Language Models (LLMs) · Human-computer interaction
Notes
Open for commentary — connections to other work, critiques, follow-up reading.