Towards the 3D Human Foundation Agent

Event: CVPR Workshops, June 2024 · Duration: 216 min · ▶ Watch on YouTube

Abstract

This workshop explores the concept of a 3D Human Foundation Agent (HFA), aiming to make computers more human-like by integrating perception, reasoning, embodiment, and behavior. Speakers discuss the critical need for high-quality, diverse, and dynamic human motion data, highlighting challenges in data collection, particularly for embodied intelligence and physical interactions. The session also introduces novel approaches like ChatPose, which leverages large language models to understand and generate 3D human poses from natural language, demonstrating progress in reasoning about human behavior and generalizing to complex scenarios. The discussions emphasize the importance of synthetic data, global coordinate estimation, and efficient computational methods to advance the field.

Speakers

  • Michael J. Black — Max Planck Institute for Intelligent Systems & Meshcapade
  • Karen Liu — Stanford University
  • Siwei Zhang — Meta Reality Labs

Talks (3)

  • 00:00:00 — Michael J. Black: Towards the 3D Human Foundation Agent
    • Introduces the concept of a Human Foundation Agent (HFA) and its four key components: perception, reasoning, embodiment, and behavior, emphasizing the need for making computers more human-like.
  • 01:08:00Karen Liu: Data Collection for Embodied Intelligence
    • Discusses the challenges and approaches to collecting high-quality, diverse, and dynamic human motion data for embodied intelligence, highlighting the DexCap system and the need for contextual and physical interaction data.
  • 02:22:00Siwei Zhang: ChatPose: Chatting about 3D Human Pose
    • Introduces ChatPose, a large language model capable of understanding and generating 3D human poses from natural language descriptions and images, demonstrating its ability to reason about poses and generalize to extreme occlusion cases.

Methods / Models / Datasets Mentioned

  • SMPL 2015
  • Loper, Mahmood, Romero, Pons-Moll, Black
  • NeRFs
  • Gaussian Splatting
  • HUGS: Human Gaussian Splats, Kocabas, et al., CVPR'24
  • HAAR
  • TokenHMR
  • TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation, Dwivedi et al. CVPR 2024
  • VQ-VAE
  • WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion, Soyong Shin, Juyong Kim, Eni Halilaj, Michael J. Black, CVPR 2024
  • DPVO (+ HMR2.0) [6, 45]
  • GLAMR [54]
  • TRACE [43]
  • SLAHMR [52]
  • DPVO [45]
  • WHAM (w/ DPVO [45])
  • WHAM (w/ DROID [44])
  • WHAM (w/ GT gyro)
  • SLAHMR [4]
  • PACE [5]
  • EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling, Haiyang Liu, et al. CVPR 2024
  • EMAGE
  • Emotional Speech-Driven 3D Body Animation via Disentangled Latent Diffusion, Kiran Chhatre, et al. CVPR 2024
  • AMUSE
  • PoseScript: 3D Human Poses from Natural Language, Delmas et al. ECCV 2022
  • BEDLAM
  • HMR
  • GPT-4o
  • GPT-4
  • DALL-E
  • HMR 2.0
  • LLaVA
  • LLaVA*-S
  • LLaVA-P
  • GPT4-S
  • GPT4-P
  • ChatPose
  • DexCap
  • NVIDIA CUDA
  • ImageNet
  • EGO-EXO4D
  • HARMONY-4D
  • SMPLOlympics
  • SMPL
  • Mask2Former
  • Nymmeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild, Lingni Ma et al, 2024
  • HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Devices, arXiv soon
  • WANDR: Intention-guided Human Motion Generation, Markos Diomataris, et al.
  • SCULPT: Shape-Conditioned Unpaired Learning of Pose dependent Clothed and Textured Human Meshes, Sanyal et al.
  • HIT: Estimating Internal Human Implicit Tissues from the Body Surface, Keller, et al.
  • VAREN: Very Accurate and Realistic Equine Network, Zuffi, et al.
  • Generative Proxemics: A Prior for 3D Social Interaction from Images, Müller, et al.

Topics

Human Foundation Agent (HFA) · 3D human pose estimation · Embodied intelligence · Motion synthesis · Data collection · Perception · Reasoning · Embodiment · Behavior · Synthetic data · Global coordinates · Large Language Models (LLMs) · Human-computer interaction


Notes

Open for commentary — connections to other work, critiques, follow-up reading.