2nd Workshop on Compositional 3D Vision (C3DV) and 3DCoMPaT challenge

Event: CVPR 2024 · Duration: 515 min · ▶ Watch on YouTube

Abstract

This workshop segment features two main talks. The first, by Andrea Vedaldi, delves into the integration of machine learning and 3D vision, covering topics from 2D feature fusion for 3D reconstruction and handling dynamic scenes to language-controlled 3D editing and generation. The second talk, by Katerina Fragkiadaki, explores compositionality through three lenses: memory indexing for parsing novel 3D scenes, generative feedback for lifting monocular videos to 4D representations, and the unification of 2D and 3D perception using a single foundational model. Both talks highlight advancements in creating structured, compositional, and dynamic 3D AI systems. This segment features a talk by Minhyuk Sung from KAIST on “Learning the Compositionality of 3D Objects: Splitting, Relating, and Combining Parts” at the 2nd Workshop on Compositional 3D Vision at CVPR 2024. The presentation delves into how 3D objects can be understood and manipulated by decomposing them into fundamental components, establishing relationships between these parts, and then combining them to form new structures. Drawing parallels with linguistic compositionality, the speaker discusses various methods for “splitting” objects into regions, geometric primitives, bounding primitives, keypoints, and functional representations, as well as techniques for “relating” components through unsupervised learning of covariability and intuitive meta-handles. Finally, the talk covers approaches for “combining” parts through retrieval, assembly, and diffusion models for 3D shape generation and manipulation, including applications in NeRF and SVG editing. This video segment features a presentation by Srinath Sridhar from Brown University, delivered at the C3DV Workshop CVPR 2024. The talk introduces the concept of compositional 3D understanding and editing, emphasizing a major representational shift in 3D vision. It delves into how radiance fields, powered by MLPs, grids, and primitives, are used to encode appearance and shape from RGB images. The presentation also showcases the influence of generative AI, such as Stable Diffusion and DALL-E, in creating and manipulating 3D content, demonstrating interactive editing capabilities within complex 3D scenes. This segment features five distinct talks on various topics in computer vision and graphics. The first talk introduces a 3D shape editing method using coupled neural shape optimization. The second presents a zero-shot approach for generating 3D human-scene interactions from text. The third discusses an unsupervised framework for discovering 3D prototypes in aerial scans. The fourth talk details a unified graph-diffusion model for 2D and 3D reassembly tasks. Finally, the fifth talk explores single mesh diffusion models with field latents for texture generation. This segment introduces two challenges related to 3D shape understanding. The first, 3DCoMPaT++ Challenge, presents a large-scale dataset designed to foster compositional understanding of 3D shapes through detailed part-material annotations and stylized variants, benchmarking tasks like part segmentation and compositional recognition. The second, Visual Shape Inference Challenge, focuses on inferring programmatic representations of 3D objects from visual inputs, evaluating solutions based on reconstruction accuracy and program conciseness using the 3DCoMPaT++ dataset. This segment features two invited talks on 3D vision and interaction. The first talk by Dr. Xiaojuan Qi introduces a comprehensive framework for simulating interactive 3D environments from videos, detailing advancements in depth estimation, implicit surface reconstruction, object decomposition, and dynamic scene modeling. The second talk by Angela Dai focuses on 3D perception, reconstruction, and interaction, presenting methods for leveraging synthetic priors for robust 3D understanding, generating novel 3D meshes, and enabling zero-shot 3D interactions through knowledge distillation. Both talks highlight current challenges, future directions, and the practical applications of their research in areas like virtual reality and content creation, followed by a Q&A session.

Speakers

Andrea Vedaldi — University of Oxford
Katerina Fragkiadaki — Carnegie Mellon University
Minhyuk Sung — KAIST
Srinath Sridhar — BROWN
Jingyu Hu — The Chinese University of Hong Kong
Ka-Hei Hui — The Chinese University of Hong Kong
Zhengzhe Liu — The Chinese University of Hong Kong
Hao (Richard) Zhang — Simon Fraser University
Chi-Wing Fu — The Chinese University of Hong Kong
Lei Li — Technical University of Munich
Angela Dai — Technical University of Munich
Romain Loiseau — LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, France
Elliot Vincent — Univ Gustave Eiffel, IGN, ENSG, LASTIG, France
Mathieu Aubry — INRIA Paris, France
Loic Landrieu — Univ Gustave Eiffel, IGN, ENSG, LASTIG, France
Gianluca Scarpellini — IIT PAVIS, Istituto Italiano di Tecnologia (IIT)
Stefano Fiorini — IIT PAVIS, Istituto Italiano di Tecnologia (IIT)
Francesco Giuliari — IIT PAVIS, Istituto Italiano di Tecnologia (IIT)
Pietro Morerio — Università degli Studi di Genova
Alessio Del Bue — IIT PAVIS, Istituto Italiano di Tecnologia (IIT)
Thomas W. Mitchel — Google Research
Carlos Esteves — Google Research
Ameesh Makadia — Google Research
Xiang Li — Vision-CAIR, KAUST
Habib Slim — Vision-CAIR, KAUST
Mohamed Elhoseiny — Vision-CAIR, KAUST
Aditya Ganeshan
R. Kenny Jones
Daniel Ritchie
Dr. Xiaojuan Qi — University of Hong Kong
Alexey Bokhovkin — Technical University of Munich
David Rozenberszki — Technical University of Munich
Chandan Yeshwanth — Technical University of Munich
Yueh-Cheng Liu — Technical University of Munich
Quan Meng — Technical University of Munich
Daoyi Gao — Technical University of Munich
Christian Diller — Technical University of Munich

Talks (13)

00:02:09 — Andrea Vedaldi: Structure (and composition) in 3D AI
- This talk explores the synergy between machine learning and 3D vision, covering topics from 2D feature fusion for 3D reconstruction and handling dynamic scenes to language-controlled 3D editing and generation. The speaker highlights the benefits of latent space editing for speed and generalizability.
01:24:24 — Katerina Fragkiadaki: Compositionality through memory indexing, generative feedback, and 2D/3D unification
- This talk presents three approaches to achieving compositionality in 3D AI: memory indexing for parsing novel scenes, generative feedback for lifting monocular videos to 4D, and the unification of 2D and 3D perception using a single foundational model that handles both 2D images and 3D data. It emphasizes learning from limited data and leveraging existing knowledge for robust perception and generation.
01:26:00 — Minhyuk Sung: Learning the Compositionality of 3D Objects: Splitting, Relating, and Combining Parts
- This talk explores learning the compositionality of 3D objects by breaking them into parts (splitting), understanding their interrelationships (relating), and reassembling them (combining), drawing inspiration from linguistic compositionality.
02:55:55 — Srinath Sridhar: Compositional 3D Understanding and Editing
- This talk explores the evolving landscape of 3D vision, highlighting the shift towards neural radiance fields for compositional understanding and editing of 3D scenes, and the integration of generative AI.
04:17:38 — Jingyu Hu: CNS-Edit: 3D Shape Editing via Coupled Neural Shape Optimization
- This talk introduces CNS-Edit, a novel 3D shape editing method that leverages coupled neural shape optimization for high-quality and flexible shape manipulation.
04:25:20 — Lei Li: GenZI: Zero-Shot 3D Human-Scene Interaction Generation
- This talk presents GenZI, a zero-shot approach for generating realistic 3D human-scene interactions from text descriptions, without relying on captured 3D interaction data.
04:33:30 — Romain Loiseau: Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans
- This talk introduces the Learnable Earth Parser, an unsupervised framework for discovering 3D prototypes in aerial LiDAR scans, enabling semantic and instance segmentation without extensive annotations.
04:43:29 — Stefano Fiorini: DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly
- This talk presents DiffAssemble, a unified graph-diffusion model for 2D and 3D reassembly tasks, demonstrating state-of-the-art performance in reconstructing objects from fragmented pieces.
04:56:53 — Thomas W. Mitchel: Single Mesh Diffusion Models with Field Latents for Texture Generation
- This talk introduces a novel approach for high-resolution texture synthesis on 3D meshes using latent diffusion models that operate directly on surface features, enabling generative transfer across different geometries.
05:43:38 — Xiang Li: 3DCoMPaT++ Challenge
- This talk introduces the 3DCoMPaT++ dataset and challenge, which aims to foster compositional understanding of 3D shapes by providing a large-scale dataset with detailed part-material annotations, stylized shapes, and 2D/3D data for various recognition and generation tasks.
05:47:35 — Aditya Ganeshan: Visual Shape Inference Challenge
- This talk presents the Visual Shape Inference Challenge 2024, focusing on inferring programmatic representations of 3D objects from visual input, emphasizing accuracy and parsimony in the inferred programs.
07:20:16 — Dr. Xiaojuan Qi: Learning to Simulate the 3D Visual World from Videos
- This talk presents a multi-step framework for converting videos into interactive 3D environments, covering depth estimation from monocular/stereo images, neural implicit surface reconstruction, interactive object decomposition, and dynamic scene reconstruction, followed by a Q&A session addressing model limitations, future work, applications, and ethical implications.
08:11:45 — Angela Dai: From Understanding to Interacting with the 3D World
- This talk explores 3D perception and interaction, introducing methods for learning 3D by retrieving and aligning synthetic priors, generating new 3D meshes with MeshGPT, modeling real-world complexity and dynamics, and enabling zero-shot 3D interactions through knowledge distillation from large vision-language models, followed by a Q&A session discussing various aspects of the presented models.

Key Takeaways

Fusing 2D features and semantic information into 3D models significantly enhances reconstruction, segmentation, and understanding of complex scenes.
Addressing motion and dynamic content in 3D reconstruction requires advanced tracking and deformable object modeling techniques, often leveraging large-scale internet data.
Latent space editing and language-controlled generation offer powerful and efficient ways to manipulate and create 3D content compositionally.
Unifying 2D and 3D perception through foundational models that learn from diverse data types is crucial for building robust and generalizable AI systems.
Understanding 3D object compositionality involves splitting objects into meaningful parts, learning their relationships, and combining them to create new structures.
Various representations (regions, geometric primitives, keypoints, functional representations) and learning paradigms (supervised, unsupervised, linguistic descriptions) can be used for part-level understanding.
Diffusion models, combined with implicit and explicit representations, offer powerful tools for 3D shape generation, manipulation, and editing, including applications in NeRF and SVG.
Leveraging linguistic descriptions and reference games can provide valuable supervision for learning semantic part segmentation and compositional structures in 3D.
3D vision is experiencing a significant shift towards neural radiance fields, which can encode 3D scene appearance and shape from only RGB images.
Generative AI models are increasingly being utilized to create and manipulate realistic 3D content, opening new avenues for scene understanding and editing.
The presentation highlights various state-of-the-art methods in neural radiance fields and their applications in compositional 3D tasks.
Interactive tools are being developed to allow users to scale and group objects within 3D scenes, demonstrating practical editing capabilities.
CNS-Edit offers a novel approach to 3D shape editing by optimizing neural shapes, achieving high-quality and flexible manipulations.
GenZI enables zero-shot generation of 3D human-scene interactions from text prompts, overcoming limitations of captured 3D data.
The Learnable Earth Parser provides an unsupervised framework for discovering 3D prototypes in aerial LiDAR scans, facilitating semantic and instance segmentation.
DiffAssemble demonstrates a unified graph-diffusion model for 2D and 3D reassembly, achieving state-of-the-art results in reconstructing fragmented objects.
Single Mesh Diffusion Models with Field Latents offer an efficient way to synthesize high-resolution textures directly on 3D mesh surfaces, supporting generative transfer across diverse geometries.
The 3DCoMPaT++ dataset provides extensive annotations for part-material compositions and stylized shape variants, crucial for advancing compositional 3D understanding.
The Visual Shape Inference Challenge aims to develop systems that can automatically infer programmatic representations of 3D shapes, offering a structured and interpretable way to model objects.
Both challenges leverage the 3DCoMPaT++ dataset to benchmark various tasks, including fine-grained part segmentation, compositional recognition, and text-based 3D editing/retrieval.
The challenges encourage the development of accurate and parsimonious models for 3D shape understanding, with evaluation metrics combining reconstruction accuracy and program conciseness.
Current learning paradigms for 3D understanding are often passive and lack interaction with the physical world, unlike human learning, necessitating the creation of interactive virtual 3D environments.
Synthetic data plays a crucial role in overcoming the scarcity and quality issues of real-world 3D ground truth, enabling robust depth estimation and 3D reconstruction models.
Novel 3D representations and reconstruction techniques, such as neural implicit surfaces and 3D Gaussian Splatting, are being developed to achieve high-quality, scalable, and controllable 3D environments from videos.
Future research directions include addressing challenges in video-consistent monocular estimation, developing unified 3D representations for large-scale dynamic scenes, and leveraging foundation models for zero-shot 3D interaction generation and mesh synthesis.

Methods / Models / Datasets Mentioned

3D DiffuserActor
3D Gaussian Splatting
3DCoMPaT++
3DCoMPaT200
ABC (CAD Model Dataset)
ABO
AKB-48
ARAP Loss
AlexNet
Align Your Gaussians
As-Plausible-As-Possible (APAP)
AtlasNet
BPNet
BSPNet
BundleFusion
CG-HOI
CLIP
CLIP-Based Compositional Structure Learning
CNS-Edit
COCO
CSG Lite
Capri-Net
ChatGPT
CoTracker
ComplementMe
Contrastive Lift
CurveNet
DALL-E
DETR3D
DGD
DINO
DINO-v2
DSMNet
Deep Functional Dictionaries
Deep Marching Tetrahedra (DMTet)
DeepMetaHandles
DeformSyncNet
DiffAssemble
DiffCAD
Diffusion-SDF
DiffusionSDF
DragAPart
DreamFusion
DreamFusion 3D (NeRF)
DreamScene4D
FL-VAEs
FLDM
FMGS
Farm3D
Feature 3DGS
Fusion 360 Gallery Dataset
GAPartNet
GARField
GET3D
GNN
GPT-4V
GPT-style Transformer
Gemini
GenZI
GeoNet
GlobFit
GoogleLeNet
IM-3D
ImageNet
Instant3D
InverseCSG
KNN
Kestrel
Kinect
LAION
LERF
LIION (Diffusion Model)
LVIS
LangSplat
Learnable Earth Parser
Lidar
MagicPony
Marching-Primitives
Mask2CAD
MeshGPT
Meta Llama 3
Midjourney
Mip-NeRF
ModelNet
MonoSDF
NFD (Neural Feature Diffusion)
NeRF
Nested Neural Feature Fields (N2F2)
Neural Feature Fusion Fields (N3F)
Neural Radiance Fields (NeRF)
OCC-SDF
ODIN (Omni-Dimensional INstance Segmentation)
Objaverse
Objaverse-XL
ObjectNet3D
OmniObject3D
OptCtrlPoints
PCT
PLAD
Panoptic Lifting
Park and Sung, Split, Merge, and Refine
PartGlot
PartNet
Particle videos revisited
Patch2CAD
PointNeXt
PointNet
PointNet++
PointNet++RGB
PointNet+SegFormer
PointStack
PolyGen
Posterior Distillation Sampling (PDS)
RePaint
ResNet
ResNet Decoder
SALAD (Part-Level Latent Diffusion)
SAM
SAM (Segment Anything Model)
SAPIEN
SC-GS
SDEdit
SDS Loss (Score Distillation Sampling)
SHAP-EDITOR
SIRI
SMAL
SMPL
ScanNet
ScanNet++
SceneScript
Shap-E
ShapeGlot
ShapeNet
ShapeNet-Part
ShapeTalk
ShapeWalk
Sin3DM
Single Mesh Diffusion Models with Field Latents
Stable Diffusion
TAP-Vid
Total-Decom
VGG
VectorFusion
ViT
Visual Genome
YouTube 8M

Topics

2D/3D Unification · 2D/3D reassembly · 3D Computer Vision · 3D Editing · 3D Interaction · 3D Object Compositionality · 3D Reconstruction · 3D Shape Generation · 3D Vision · 3D datasets · 3D prototypes · 3D shape editing · 3D shape understanding · Compositional 3D Understanding · Compositionality · Deformable Objects · Depth Estimation · Diffusion Models · Dynamic Scene Reconstruction · Dynamic Scenes · Gaussian Splatting · Generative AI · Interactive 3D · Interactive 3D Environments · Language-Controlled 3D Editing · Linguistic Inspiration · Machine Learning · Mesh Generation · Neural Implicit Surfaces · Neural Radiance Fields · Object Decomposition · Part Segmentation · Relational Learning · Scene Representation · Synthetic Data · Text-Guided Editing · Unsupervised Learning · aerial scans · compositional recognition · graph-diffusion models · human-scene interaction · latent diffusion models · neural shape optimization · part-material segmentation · text-to-shape retrieval · texture generation · unsupervised learning · visual program inference · zero-shot generation

Notes

Open for commentary — connections to other work, critiques, follow-up reading.