3D scene understanding for interactive agents

Event: CVPR Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics 2025 · Duration: 28 min · ▶ Watch on YouTube

Abstract

The presentation delves into the critical role of 3D scene understanding for developing interactive AI agents and building realistic simulation environments. It highlights the need for agents to comprehend object properties, room layouts, and interaction possibilities within 3D spaces. Recent research projects are showcased, demonstrating progress in generating compositional 3D scenes from 2D images and text, as well as detecting and understanding articulated objects for agent action.

Speakers

Angel Xuan Chang — Simon Fraser University, Amii

Talks (1)

00:00:00 — Angel Xuan Chang: 3D scene understanding for interactive agents
- This talk explores the challenges and recent advancements in 3D scene understanding, focusing on its application for interactive agents and the creation of new environments for simulation.

Key Takeaways

3D scene understanding is crucial for both creating new simulation environments and enabling intelligent agent actions.
Recent advancements leverage multi-modal learning (text, image, shape) to improve 3D shape retrieval and scene generation from 2D inputs.
DuoduoCLIP demonstrates efficient 3D understanding by using multi-view images and pre-aligned text-image encoders, achieving state-of-the-art retrieval performance.
Understanding articulated objects and their motion parameters is vital for agents to interact realistically with environments, with models like OPD and OPDMulti showing promise.
Future work extends beyond traditional 3D scenes to integrate biological data like DNA barcodes for taxonomic classification, showcasing the broad applicability of multi-modal representation learning.

Methods / Models / Datasets Mentioned

ScanRefer
Scan2Cap
3DVQA
nlsiam
Diorama
DuoDuoCLIP
MultiScan
OPD
OPDMulti
OWLv2
SAM
CLIP
TriColo
ULIP
OpenShape
Uni3D
CLIBD
OPDFormer
SAPIEN: PartNet-Mobility
Motion Annotation Programs
S2O
SINGAPO

Topics

3D scene understanding · Interactive agents · Embodied AI · Multi-modal representation learning · Vision-language tasks in 3D · Generative models for 3D scenes · Articulated objects · Robotics

Notes

Open for commentary — connections to other work, critiques, follow-up reading.