Understanding 3D Humans in Contextual 3D Spaces

Event: CVPR 2025 Workshop · Duration: 25 min · ▶ Watch on YouTube

Abstract

This talk delves into the critical area of understanding 3D human motion within rich contextual 3D spaces. The speaker, Hanbyul Joo, highlights the significance of this research for interpreting both human-object and human-human interactions. The presentation covers two main research thrusts: developing advanced motion capture systems like MocapEveryone and ParaHome to gather high-resolution, long-term, and in-the-wild 3D human motion data, and pioneering methods like DAVID to learn dynamic 3D Human-Object Interactions (HOI) from synthetic imagery. By leveraging generative models and novel pipelines, the research aims to overcome the limitations of real-world data collection and enable the synthesis of diverse and controllable human behaviors in complex environments.

Speakers

Hanbyul Joo — Seoul National University

Talks (34)

00:00:00 — Hanbyul Joo: Understanding 3D Humans in Contextual 3D Spaces
- Introduction to the talk, emphasizing the importance of understanding 3D humans in context for both human-object and human-human interactions.
00:00:28 — Hanbyul Joo: Why Humans Move the Way We do
- Discusses two primary reasons for human movement: accomplishing physical goals (human-object interaction) and transmitting communicative signals (human-human interaction).
00:01:13 — Hanbyul Joo: Why Global 3D Human?
- Explains that capturing 3D humans in global locations with their environment and other people allows for a better understanding of human behavior in context.
00:01:49 — Hanbyul Joo: Proxemics: The Study of Human Use of Space in Social Interaction
- Introduces the concept of proxemics and how humans use body proximity to convey signals in interpersonal communication, illustrating with examples of social distances.
00:03:01 — Hanbyul Joo: The Panoptic Studio
- Describes the Panoptic Studio, a multi-camera system used to capture social interactions, and an experiment called the ‘Haggling Game’ to study social behaviors.
00:04:20 — Hanbyul Joo: Social Formation Prediction
- Demonstrates predicting the desired location of a target person in a social interaction using the locations and orientations of other people, revealing specific social patterns.
00:06:02 — Hanbyul Joo: Understanding Human-Human Interactions from in-the-wild images
- Highlights the new research opportunity to measure human behavior from in-the-wild images and videos, leveraging advancements in 3D human reconstruction.
00:06:57 — Hanbyul Joo: How To Capture Everyday 3D Motions
- Introduces MocapEveryone, a lightweight wearable mocap system using a head-mounted camera and smartwatches for long-term, in-the-wild motion capture.
00:07:39 — Hanbyul Joo: Mocap Everyone Everywhere: Challenges
- Discusses challenges in MocapEveryone, including different modalities, sparsity and ambiguity of sensor inputs, and robustness in outdoor/large-scale scenes.
00:08:00 — Hanbyul Joo: Multimodal Sensor Alignment
- Explains the pipeline for aligning video and IMU signals, using SLAM for world coordinates and point clouds, and tracking floor levels to estimate human height.
00:09:09 — Hanbyul Joo: Mocap Everywhere, Everything, Everyone
- Shows examples of MocapEveryone capturing full-body motion in outdoor scenes and multi-person social interactions by aligning multiple wearable systems.
00:10:10 — Hanbyul Joo: How To Capture Everyday 3D Motions (ParaHome)
- Introduces ParaHome, a system for capturing accurate 3D motions of body, finger, and articulated objects in a room-scale environment for human-object interactions.
00:10:17 — Hanbyul Joo: ParaHome: Parameterizing Everyday Home Activities Towards 3D Generative Modeling of Human-Object Interactions
- Details the ParaHome system, which captures dexterous hand manipulation, body motion, and object dynamics in natural room settings, including sequential manipulation scenes and text descriptions.
00:11:22 — Hanbyul Joo: ParaHome System and Dataset
- Describes the ParaHome dataset, which includes 486 minutes of diverse sequences from 38 subjects and 5476 text descriptions, along with 3D scanned and articulated object models.
00:12:50 — Hanbyul Joo: Affordance During HOI
- Illustrates how the ParaHome system can estimate pseudo-contact maps and distances between human hands and objects, providing insights into affordance.
00:13:55 — Hanbyul Joo: Synthesizing Body Motion for Desired Object Manipulation
- Explores the application of ParaHome data to synthesize human body motion given a sequence of object states, demonstrating the ability to generate desired human-object interactions.
00:14:38 — Hanbyul Joo: Learning Dynamic 3D HOI from Images?
- Poses the challenge of learning dynamic 3D HOI directly from in-the-wild images and videos, which are often noisy, biased, and lack diversity.
00:15:03 — Hanbyul Joo: DAVID: Modeling Dynamic Affordance of 3D Objects using Pre-trained Video Diffusion Models
- Introduces DAVID, a method that leverages pre-trained video diffusion models to synthesize images and videos of humans interacting with specific objects, allowing for better control over viewpoints and scenarios.
00:16:09 — Hanbyul Joo: Why Learning 3D HOI From Synthetic Imagery?
- Justifies the use of synthetic imagery for learning 3D HOI due to the challenges of real-world data, such as noise, clutter, biased viewpoints, and limited diversity.
00:16:18 — Hanbyul Joo: Can We Learn 3D HOI from Internet Images?
- Compares the limitations of internet image search results for HOI (noise, clutter, bias) with the controllability offered by diffusion models for generating diverse synthetic data.
00:16:57 — Hanbyul Joo: Learning from Pre-Trained Diffusion Model
- Shows how pre-trained diffusion models can generate photo-realistic images with better controllability, allowing for diverse viewpoints and scenarios for HOI learning.
00:18:00 — Hanbyul Joo: How To Capture Everyday 3D Motions (Summary of Lab’s Efforts)
- Summarizes the lab’s efforts in 3D spatial layout (CHORUS), comprehensive affordance (COMa), and dynamic affordance (DAVID) for capturing and understanding everyday 3D motions.
00:18:21 — Hanbyul Joo: DAVID: Modeling Dynamic Affordance (Two Step Pipeline)
- Outlines the two-step pipeline of DAVID: first, generating 4D HOI samples from a 3D object, and second, learning dynamic affordance from these samples.
00:19:09 — Hanbyul Joo: 4D HOI Sample Generation
- Details the process of generating 4D HOI samples by rendering 3D objects from chosen viewpoints, extracting Canny edges, and synthesizing 2D HOI videos using ControlNet and image-to-video diffusion.
00:19:45 — Hanbyul Joo: Obtaining Human and Object Motions
- Explains how to obtain world-grounded human and object motions from the synthesized 2D HOI videos using HMR and PnP with 2D object correspondences.
00:21:01 — Hanbyul Joo: Resolving Depth Ambiguity
- Describes how to resolve depth ambiguity for the object by using estimated metric depth and contact cues to optimize the object’s depth and scale.
00:21:47 — Hanbyul Joo: Obtaining Human and Object Motions (Final Alignment)
- Illustrates the final alignment of human, object, and camera motions in a common world coordinate system after resolving depth ambiguities.
00:22:19 — Hanbyul Joo: Results: Dynamic Affordance for Various Objects
- Presents results of the DAVID pipeline, showing generated 2D HOI videos and corresponding 4D HOI samples for various objects like barbells, guitars, carts, and scooters.
00:22:58 — Hanbyul Joo: Learning Dynamic Affordance: Learning Human Motion Patterns with LoRA Module
- Explains how a LoRA module is used to fine-tune a pre-trained motion diffusion model to learn object-specific HOI motion patterns, enabling diverse motion sampling.
00:23:42 — Hanbyul Joo: Learning Dynamic Affordance: Learning Human Guided Object Pose Patterns
- Describes the training and inference process for learning dynamic affordance, incorporating smoothness and contact guidance for object pose patterns.
00:23:51 — Hanbyul Joo: Learning Dynamic Affordance: Synthesizing Diverse HOI Motions
- Demonstrates the synthesis of diverse HOI motions, including complex interactions with multiple objects, by fusing different LoRA modules for motion interpolations.
00:24:46 — Hanbyul Joo: Why Global 3D Human? (Conclusion)
- Recap of the talk, reiterating the importance of understanding 3D humans in context for both human-object and human-human interactions.
00:24:53 — Hanbyul Joo: References
- Lists the academic papers and projects discussed during the presentation.
00:24:58 — Hanbyul Joo: Collaborators
- Acknowledges the team members and collaborators involved in the presented research.

Key Takeaways

Understanding 3D humans in contextual 3D spaces is crucial for interpreting both human-object and human-human interactions, providing deeper insights into behavior.
Novel motion capture systems like MocapEveryone (wearable) and ParaHome (multi-camera room-scale) are being developed to collect diverse, high-resolution 3D human motion data in various, more natural environments.
Generative models, particularly diffusion models, offer a powerful approach to synthesize diverse and controllable 3D human-object interactions, overcoming the limitations of real-world data collection.
The DAVID pipeline demonstrates a two-step process to model dynamic affordance by generating 4D HOI samples from 3D objects using diffusion models and then learning specific human motion patterns with LoRA modules.

Methods / Models / Datasets Mentioned

Panoptic Studio
Haggling Game
MocapEveryone
ParaHome
World-Grounded HMR
ControlNet
Video Diffusion (KingAI)
LoRA module
PnP (Perspective-n-Point) algorithm
DROID SLAM
SMPL (human body model)
CHORUS
COMa
DAVID

Topics

3D Human Understanding · Human-Object Interaction (HOI) · Human-Human Interaction · Motion Capture · Proxemics · Social Behavior · Synthetic Data Generation · Diffusion Models · Dynamic Affordance · Wearable Sensors

Notes

Open for commentary — connections to other work, critiques, follow-up reading.