2nd Workshop on Compositional 3D Vision (C3DV) and 3DCoMPaT challenge

Event: CVPR 2024 · Duration: 515 min · ▶ Watch on YouTube

Abstract

This workshop segment features two main talks. The first, by Andrea Vedaldi, delves into the integration of machine learning and 3D vision, covering topics from 2D feature fusion for 3D reconstruction and handling dynamic scenes to language-controlled 3D editing and generation. The second talk, by Katerina Fragkiadaki, explores compositionality through three lenses: memory indexing for parsing novel 3D scenes, generative feedback for lifting monocular videos to 4D representations, and the unification of 2D and 3D perception using a single foundational model. Both talks highlight advancements in creating structured, compositional, and dynamic 3D AI systems. This segment features a talk by Minhyuk Sung from KAIST on “Learning the Compositionality of 3D Objects: Splitting, Relating, and Combining Parts” at the 2nd Workshop on Compositional 3D Vision at CVPR 2024. The presentation delves into how 3D objects can be understood and manipulated by decomposing them into fundamental components, establishing relationships between these parts, and then combining them to form new structures. Drawing parallels with linguistic compositionality, the speaker discusses various methods for “splitting” objects into regions, geometric primitives, bounding primitives, keypoints, and functional representations, as well as techniques for “relating” components through unsupervised learning of covariability and intuitive meta-handles. Finally, the talk covers approaches for “combining” parts through retrieval, assembly, and diffusion models for 3D shape generation and manipulation, including applications in NeRF and SVG editing. This video segment features a presentation by Srinath Sridhar from Brown University, delivered at the C3DV Workshop CVPR 2024. The talk introduces the concept of compositional 3D understanding and editing, emphasizing a major representational shift in 3D vision. It delves into how radiance fields, powered by MLPs, grids, and primitives, are used to encode appearance and shape from RGB images. The presentation also showcases the influence of generative AI, such as Stable Diffusion and DALL-E, in creating and manipulating 3D content, demonstrating interactive editing capabilities within complex 3D scenes. This segment features five distinct talks on various topics in computer vision and graphics. The first talk introduces a 3D shape editing method using coupled neural shape optimization. The second presents a zero-shot approach for generating 3D human-scene interactions from text. The third discusses an unsupervised framework for discovering 3D prototypes in aerial scans. The fourth talk details a unified graph-diffusion model for 2D and 3D reassembly tasks. Finally, the fifth talk explores single mesh diffusion models with field latents for texture generation. This segment introduces two challenges related to 3D shape understanding. The first, 3DCoMPaT++ Challenge, presents a large-scale dataset designed to foster compositional understanding of 3D shapes through detailed part-material annotations and stylized variants, benchmarking tasks like part segmentation and compositional recognition. The second, Visual Shape Inference Challenge, focuses on inferring programmatic representations of 3D objects from visual inputs, evaluating solutions based on reconstruction accuracy and program conciseness using the 3DCoMPaT++ dataset. This segment features two invited talks on 3D vision and interaction. The first talk by Dr. Xiaojuan Qi introduces a comprehensive framework for simulating interactive 3D environments from videos, detailing advancements in depth estimation, implicit surface reconstruction, object decomposition, and dynamic scene modeling. The second talk by Angela Dai focuses on 3D perception, reconstruction, and interaction, presenting methods for leveraging synthetic priors for robust 3D understanding, generating novel 3D meshes, and enabling zero-shot 3D interactions through knowledge distillation. Both talks highlight current challenges, future directions, and the practical applications of their research in areas like virtual reality and content creation, followed by a Q&A session.

Speakers

  • Andrea Vedaldi — University of Oxford
  • Katerina Fragkiadaki — Carnegie Mellon University
  • Minhyuk Sung — KAIST
  • Srinath Sridhar — BROWN
  • Jingyu Hu — The Chinese University of Hong Kong
  • Ka-Hei Hui — The Chinese University of Hong Kong
  • Zhengzhe Liu — The Chinese University of Hong Kong
  • Hao (Richard) Zhang — Simon Fraser University
  • Chi-Wing Fu — The Chinese University of Hong Kong
  • Lei Li — Technical University of Munich
  • Angela Dai — Technical University of Munich
  • Romain Loiseau — LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, France
  • Elliot Vincent — Univ Gustave Eiffel, IGN, ENSG, LASTIG, France
  • Mathieu Aubry — INRIA Paris, France
  • Loic Landrieu — Univ Gustave Eiffel, IGN, ENSG, LASTIG, France
  • Gianluca Scarpellini — IIT PAVIS, Istituto Italiano di Tecnologia (IIT)
  • Stefano Fiorini — IIT PAVIS, Istituto Italiano di Tecnologia (IIT)
  • Francesco Giuliari — IIT PAVIS, Istituto Italiano di Tecnologia (IIT)
  • Pietro Morerio — Università degli Studi di Genova
  • Alessio Del Bue — IIT PAVIS, Istituto Italiano di Tecnologia (IIT)
  • Thomas W. Mitchel — Google Research
  • Carlos Esteves — Google Research
  • Ameesh Makadia — Google Research
  • Xiang Li — Vision-CAIR, KAUST
  • Habib Slim — Vision-CAIR, KAUST
  • Mohamed Elhoseiny — Vision-CAIR, KAUST
  • Aditya Ganeshan
  • R. Kenny Jones
  • Daniel Ritchie
  • Dr. Xiaojuan Qi — University of Hong Kong
  • Alexey Bokhovkin — Technical University of Munich
  • David Rozenberszki — Technical University of Munich
  • Chandan Yeshwanth — Technical University of Munich
  • Yueh-Cheng Liu — Technical University of Munich
  • Quan Meng — Technical University of Munich
  • Daoyi Gao — Technical University of Munich
  • Christian Diller — Technical University of Munich

Talks (13)

  • 00:02:09Andrea Vedaldi: Structure (and composition) in 3D AI
    • This talk explores the synergy between machine learning and 3D vision, covering topics from 2D feature fusion for 3D reconstruction and handling dynamic scenes to language-controlled 3D editing and generation. The speaker highlights the benefits of latent space editing for speed and generalizability.
  • 01:24:24Katerina Fragkiadaki: Compositionality through memory indexing, generative feedback, and 2D/3D unification
    • This talk presents three approaches to achieving compositionality in 3D AI: memory indexing for parsing novel scenes, generative feedback for lifting monocular videos to 4D, and the unification of 2D and 3D perception using a single foundational model that handles both 2D images and 3D data. It emphasizes learning from limited data and leveraging existing knowledge for robust perception and generation.
  • 01:26:00Minhyuk Sung: Learning the Compositionality of 3D Objects: Splitting, Relating, and Combining Parts
    • This talk explores learning the compositionality of 3D objects by breaking them into parts (splitting), understanding their interrelationships (relating), and reassembling them (combining), drawing inspiration from linguistic compositionality.
  • 02:55:55Srinath Sridhar: Compositional 3D Understanding and Editing
    • This talk explores the evolving landscape of 3D vision, highlighting the shift towards neural radiance fields for compositional understanding and editing of 3D scenes, and the integration of generative AI.
  • 04:17:38Jingyu Hu: CNS-Edit: 3D Shape Editing via Coupled Neural Shape Optimization
    • This talk introduces CNS-Edit, a novel 3D shape editing method that leverages coupled neural shape optimization for high-quality and flexible shape manipulation.
  • 04:25:20Lei Li: GenZI: Zero-Shot 3D Human-Scene Interaction Generation
    • This talk presents GenZI, a zero-shot approach for generating realistic 3D human-scene interactions from text descriptions, without relying on captured 3D interaction data.
  • 04:33:30Romain Loiseau: Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans
    • This talk introduces the Learnable Earth Parser, an unsupervised framework for discovering 3D prototypes in aerial LiDAR scans, enabling semantic and instance segmentation without extensive annotations.
  • 04:43:29Stefano Fiorini: DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly
    • This talk presents DiffAssemble, a unified graph-diffusion model for 2D and 3D reassembly tasks, demonstrating state-of-the-art performance in reconstructing objects from fragmented pieces.
  • 04:56:53Thomas W. Mitchel: Single Mesh Diffusion Models with Field Latents for Texture Generation
    • This talk introduces a novel approach for high-resolution texture synthesis on 3D meshes using latent diffusion models that operate directly on surface features, enabling generative transfer across different geometries.
  • 05:43:38Xiang Li: 3DCoMPaT++ Challenge
    • This talk introduces the 3DCoMPaT++ dataset and challenge, which aims to foster compositional understanding of 3D shapes by providing a large-scale dataset with detailed part-material annotations, stylized shapes, and 2D/3D data for various recognition and generation tasks.
  • 05:47:35Aditya Ganeshan: Visual Shape Inference Challenge
    • This talk presents the Visual Shape Inference Challenge 2024, focusing on inferring programmatic representations of 3D objects from visual input, emphasizing accuracy and parsimony in the inferred programs.
  • 07:20:16Dr. Xiaojuan Qi: Learning to Simulate the 3D Visual World from Videos
    • This talk presents a multi-step framework for converting videos into interactive 3D environments, covering depth estimation from monocular/stereo images, neural implicit surface reconstruction, interactive object decomposition, and dynamic scene reconstruction, followed by a Q&A session addressing model limitations, future work, applications, and ethical implications.
  • 08:11:45Angela Dai: From Understanding to Interacting with the 3D World
    • This talk explores 3D perception and interaction, introducing methods for learning 3D by retrieving and aligning synthetic priors, generating new 3D meshes with MeshGPT, modeling real-world complexity and dynamics, and enabling zero-shot 3D interactions through knowledge distillation from large vision-language models, followed by a Q&A session discussing various aspects of the presented models.

Key Takeaways

  • Fusing 2D features and semantic information into 3D models significantly enhances reconstruction, segmentation, and understanding of complex scenes.
  • Addressing motion and dynamic content in 3D reconstruction requires advanced tracking and deformable object modeling techniques, often leveraging large-scale internet data.
  • Latent space editing and language-controlled generation offer powerful and efficient ways to manipulate and create 3D content compositionally.
  • Unifying 2D and 3D perception through foundational models that learn from diverse data types is crucial for building robust and generalizable AI systems.
  • Understanding 3D object compositionality involves splitting objects into meaningful parts, learning their relationships, and combining them to create new structures.
  • Various representations (regions, geometric primitives, keypoints, functional representations) and learning paradigms (supervised, unsupervised, linguistic descriptions) can be used for part-level understanding.
  • Diffusion models, combined with implicit and explicit representations, offer powerful tools for 3D shape generation, manipulation, and editing, including applications in NeRF and SVG.
  • Leveraging linguistic descriptions and reference games can provide valuable supervision for learning semantic part segmentation and compositional structures in 3D.
  • 3D vision is experiencing a significant shift towards neural radiance fields, which can encode 3D scene appearance and shape from only RGB images.
  • Generative AI models are increasingly being utilized to create and manipulate realistic 3D content, opening new avenues for scene understanding and editing.
  • The presentation highlights various state-of-the-art methods in neural radiance fields and their applications in compositional 3D tasks.
  • Interactive tools are being developed to allow users to scale and group objects within 3D scenes, demonstrating practical editing capabilities.
  • CNS-Edit offers a novel approach to 3D shape editing by optimizing neural shapes, achieving high-quality and flexible manipulations.
  • GenZI enables zero-shot generation of 3D human-scene interactions from text prompts, overcoming limitations of captured 3D data.
  • The Learnable Earth Parser provides an unsupervised framework for discovering 3D prototypes in aerial LiDAR scans, facilitating semantic and instance segmentation.
  • DiffAssemble demonstrates a unified graph-diffusion model for 2D and 3D reassembly, achieving state-of-the-art results in reconstructing fragmented objects.
  • Single Mesh Diffusion Models with Field Latents offer an efficient way to synthesize high-resolution textures directly on 3D mesh surfaces, supporting generative transfer across diverse geometries.
  • The 3DCoMPaT++ dataset provides extensive annotations for part-material compositions and stylized shape variants, crucial for advancing compositional 3D understanding.
  • The Visual Shape Inference Challenge aims to develop systems that can automatically infer programmatic representations of 3D shapes, offering a structured and interpretable way to model objects.
  • Both challenges leverage the 3DCoMPaT++ dataset to benchmark various tasks, including fine-grained part segmentation, compositional recognition, and text-based 3D editing/retrieval.
  • The challenges encourage the development of accurate and parsimonious models for 3D shape understanding, with evaluation metrics combining reconstruction accuracy and program conciseness.
  • Current learning paradigms for 3D understanding are often passive and lack interaction with the physical world, unlike human learning, necessitating the creation of interactive virtual 3D environments.
  • Synthetic data plays a crucial role in overcoming the scarcity and quality issues of real-world 3D ground truth, enabling robust depth estimation and 3D reconstruction models.
  • Novel 3D representations and reconstruction techniques, such as neural implicit surfaces and 3D Gaussian Splatting, are being developed to achieve high-quality, scalable, and controllable 3D environments from videos.
  • Future research directions include addressing challenges in video-consistent monocular estimation, developing unified 3D representations for large-scale dynamic scenes, and leveraging foundation models for zero-shot 3D interaction generation and mesh synthesis.

Methods / Models / Datasets Mentioned

  • 3D DiffuserActor
  • 3D Gaussian Splatting
  • 3DCoMPaT++
  • 3DCoMPaT200
  • ABC (CAD Model Dataset)
  • ABO
  • AKB-48
  • ARAP Loss
  • AlexNet
  • Align Your Gaussians
  • As-Plausible-As-Possible (APAP)
  • AtlasNet
  • BPNet
  • BSPNet
  • BundleFusion
  • CG-HOI
  • CLIP
  • CLIP-Based Compositional Structure Learning
  • CNS-Edit
  • COCO
  • CSG Lite
  • Capri-Net
  • ChatGPT
  • CoTracker
  • ComplementMe
  • Contrastive Lift
  • CurveNet
  • DALL-E
  • DETR3D
  • DGD
  • DINO
  • DINO-v2
  • DSMNet
  • Deep Functional Dictionaries
  • Deep Marching Tetrahedra (DMTet)
  • DeepMetaHandles
  • DeformSyncNet
  • DiffAssemble
  • DiffCAD
  • Diffusion-SDF
  • DiffusionSDF
  • DragAPart
  • DreamFusion
  • DreamFusion 3D (NeRF)
  • DreamScene4D
  • FL-VAEs
  • FLDM
  • FMGS
  • Farm3D
  • Feature 3DGS
  • Fusion 360 Gallery Dataset
  • GAPartNet
  • GARField
  • GET3D
  • GNN
  • GPT-4V
  • GPT-style Transformer
  • Gemini
  • GenZI
  • GeoNet
  • GlobFit
  • GoogleLeNet
  • IM-3D
  • ImageNet
  • Instant3D
  • InverseCSG
  • KNN
  • Kestrel
  • Kinect
  • LAION
  • LERF
  • LIION (Diffusion Model)
  • LVIS
  • LangSplat
  • Learnable Earth Parser
  • Lidar
  • MagicPony
  • Marching-Primitives
  • Mask2CAD
  • MeshGPT
  • Meta Llama 3
  • Midjourney
  • Mip-NeRF
  • ModelNet
  • MonoSDF
  • NFD (Neural Feature Diffusion)
  • NeRF
  • Nested Neural Feature Fields (N2F2)
  • Neural Feature Fusion Fields (N3F)
  • Neural Radiance Fields (NeRF)
  • OCC-SDF
  • ODIN (Omni-Dimensional INstance Segmentation)
  • Objaverse
  • Objaverse-XL
  • ObjectNet3D
  • OmniObject3D
  • OptCtrlPoints
  • PCT
  • PLAD
  • Panoptic Lifting
  • Park and Sung, Split, Merge, and Refine
  • PartGlot
  • PartNet
  • Particle videos revisited
  • Patch2CAD
  • PointNeXt
  • PointNet
  • PointNet++
  • PointNet++RGB
  • PointNet+SegFormer
  • PointStack
  • PolyGen
  • Posterior Distillation Sampling (PDS)
  • RePaint
  • ResNet
  • ResNet Decoder
  • SALAD (Part-Level Latent Diffusion)
  • SAM
  • SAM (Segment Anything Model)
  • SAPIEN
  • SC-GS
  • SDEdit
  • SDS Loss (Score Distillation Sampling)
  • SHAP-EDITOR
  • SIRI
  • SMAL
  • SMPL
  • ScanNet
  • ScanNet++
  • SceneScript
  • Shap-E
  • ShapeGlot
  • ShapeNet
  • ShapeNet-Part
  • ShapeTalk
  • ShapeWalk
  • Sin3DM
  • Single Mesh Diffusion Models with Field Latents
  • Stable Diffusion
  • TAP-Vid
  • Total-Decom
  • VGG
  • VectorFusion
  • ViT
  • Visual Genome
  • YouTube 8M

Topics

2D/3D Unification · 2D/3D reassembly · 3D Computer Vision · 3D Editing · 3D Interaction · 3D Object Compositionality · 3D Reconstruction · 3D Shape Generation · 3D Vision · 3D datasets · 3D prototypes · 3D shape editing · 3D shape understanding · Compositional 3D Understanding · Compositionality · Deformable Objects · Depth Estimation · Diffusion Models · Dynamic Scene Reconstruction · Dynamic Scenes · Gaussian Splatting · Generative AI · Interactive 3D · Interactive 3D Environments · Language-Controlled 3D Editing · Linguistic Inspiration · Machine Learning · Mesh Generation · Neural Implicit Surfaces · Neural Radiance Fields · Object Decomposition · Part Segmentation · Relational Learning · Scene Representation · Synthetic Data · Text-Guided Editing · Unsupervised Learning · aerial scans · compositional recognition · graph-diffusion models · human-scene interaction · latent diffusion models · neural shape optimization · part-material segmentation · text-to-shape retrieval · texture generation · unsupervised learning · visual program inference · zero-shot generation


Notes

Open for commentary — connections to other work, critiques, follow-up reading.