Towards the 3D Human Foundation Agent

Event: CVPR Workshops, June 2024 · Duration: 216 min · ▶ Watch on YouTube

Abstract

This workshop explores the concept of a 3D Human Foundation Agent (HFA), aiming to make computers more human-like by integrating perception, reasoning, embodiment, and behavior. Speakers discuss the critical need for high-quality, diverse, and dynamic human motion data, highlighting challenges in data collection, particularly for embodied intelligence and physical interactions. The session also introduces novel approaches like ChatPose, which leverages large language models to understand and generate 3D human poses from natural language, demonstrating progress in reasoning about human behavior and generalizing to complex scenarios. The discussions emphasize the importance of synthetic data, global coordinate estimation, and efficient computational methods to advance the field.

Speakers

Michael J. Black — Max Planck Institute for Intelligent Systems & Meshcapade
Karen Liu — Stanford University
Siwei Zhang — Meta Reality Labs

Talks (3)

00:00:00 — Michael J. Black: Towards the 3D Human Foundation Agent
- Introduces the concept of a Human Foundation Agent (HFA) and its four key components: perception, reasoning, embodiment, and behavior, emphasizing the need for making computers more human-like.
01:08:00 — Karen Liu: Data Collection for Embodied Intelligence
- Discusses the challenges and approaches to collecting high-quality, diverse, and dynamic human motion data for embodied intelligence, highlighting the DexCap system and the need for contextual and physical interaction data.
02:22:00 — Siwei Zhang: ChatPose: Chatting about 3D Human Pose
- Introduces ChatPose, a large language model capable of understanding and generating 3D human poses from natural language descriptions and images, demonstrating its ability to reason about poses and generalize to extreme occlusion cases.

Methods / Models / Datasets Mentioned

SMPL 2015
Loper, Mahmood, Romero, Pons-Moll, Black
NeRFs
Gaussian Splatting
HUGS: Human Gaussian Splats, Kocabas, et al., CVPR'24
HAAR
TokenHMR
TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation, Dwivedi et al. CVPR 2024
VQ-VAE
WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion, Soyong Shin, Juyong Kim, Eni Halilaj, Michael J. Black, CVPR 2024
DPVO (+ HMR2.0) [6, 45]
GLAMR [54]
TRACE [43]
SLAHMR [52]
DPVO [45]
WHAM (w/ DPVO [45])
WHAM (w/ DROID [44])
WHAM (w/ GT gyro)
SLAHMR [4]
PACE [5]
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling, Haiyang Liu, et al. CVPR 2024
EMAGE
Emotional Speech-Driven 3D Body Animation via Disentangled Latent Diffusion, Kiran Chhatre, et al. CVPR 2024
AMUSE
PoseScript: 3D Human Poses from Natural Language, Delmas et al. ECCV 2022
BEDLAM
HMR
GPT-4o
GPT-4
DALL-E
HMR 2.0
LLaVA
LLaVA*-S
LLaVA-P
GPT4-S
GPT4-P
ChatPose
DexCap
NVIDIA CUDA
ImageNet
EGO-EXO4D
HARMONY-4D
SMPLOlympics
SMPL
Mask2Former
Nymmeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild, Lingni Ma et al, 2024
HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Devices, arXiv soon
WANDR: Intention-guided Human Motion Generation, Markos Diomataris, et al.
SCULPT: Shape-Conditioned Unpaired Learning of Pose dependent Clothed and Textured Human Meshes, Sanyal et al.
HIT: Estimating Internal Human Implicit Tissues from the Body Surface, Keller, et al.
VAREN: Very Accurate and Realistic Equine Network, Zuffi, et al.
Generative Proxemics: A Prior for 3D Social Interaction from Images, Müller, et al.

Topics

Human Foundation Agent (HFA) · 3D human pose estimation · Embodied intelligence · Motion synthesis · Data collection · Perception · Reasoning · Embodiment · Behavior · Synthetic data · Global coordinates · Large Language Models (LLMs) · Human-computer interaction

Notes

Open for commentary — connections to other work, critiques, follow-up reading.