3D Generative AI: Efficient, high-def & controllable

Event: CVPR WORKSHOP ON 3D GENERATIVE AI, JUNE 2024 · Duration: 371 min · ▶ Watch on YouTube

Abstract

This workshop explores the latest advancements in 3D generative AI, focusing on achieving efficient, high-definition, and controllable generation of 3D assets, scenes, and even entire worlds. Speakers delve into innovative approaches for text-to-3D mesh generation, object insertion in neural 3D scenes, and the creation of common-sense indoor environments using scene graphs and diffusion models. Key themes include leveraging compositional structures, addressing challenges in 3D consistency and localized editing, and developing robust methods for human motion estimation and scene reconstruction from various inputs, including monocular videos and egocentric views. The discussions highlight the potential of diffusion models and novel architectural designs like Diffusion Transformers to push the boundaries of 3D content creation and perception.

Speakers

Adam Kortylewski — University of Freiburg & Max-Planck-Institute for Informatics
Lingjie Liu — UPenn
Michael Niemeyer — Google
Michael Oechsle — Google
Christian Theobalt — MPI-INF
Alan Yuille — JHU
Fangneng Zhan — MPI-INF
Gianluca Corrado — Wayve
Siyu Tang — ETH Zürich
Saining Xie — New York University
Jiajun Wu — Stanford University
Katerina Fragkiadaki — Carnegie Mellon University
Andrea Vedaldi — University of Oxford
Federico Tombari — Google, TUM

Talks (9)

00:00:00 — Adam Kortylewski: Welcome to the 2nd Workshop on Generative Models for Computer Vision
- Adam Kortylewski welcomes attendees to the 2nd Workshop on Generative Models for Computer Vision, highlighting the success of the previous year’s event and the significant progress in generative models, particularly in 3D synthesis and image generation.
00:00:00 — Gianluca Corrado: Embodied AI in Autonomous Driving
- This talk discusses Wayve’s approach to autonomous driving using end-to-end embodied AI, highlighting the benefits of computational homogeneity, hardware agnosticism, agile development, and superior performance, particularly in handling long-tail scenarios and generalizing across different vehicle types.
00:54:15 — Siyu Tang: Generative Models for Human Motion Estimation
- This talk introduces a marker-based representation for human motion, which is then used to train an autoencoder to learn motion priors. These priors are subsequently used to reconstruct human motion from noisy or incomplete observations, demonstrating improved robustness and naturalness compared to previous methods.
01:53:15 — Saining Xie: Diffusion Transformers and Beyond 🚀 and why you should stop worrying and love DiT
- This talk introduces Diffusion Transformers (DiT) as a new class of diffusion models, emphasizing their simple, scalable architecture and superior performance compared to traditional U-Nets, particularly in image generation tasks and when scaled up for text-to-image synthesis.
02:35:55 — Alan Yuille: Approximate Analysis by Synthesis
- This talk advocates for “analysis by synthesis” as a framework for computer vision, where understanding an image involves generating it from a 3D model. It emphasizes the importance of 3D compositional generative networks (3D-CGNs) that learn object intrinsics and can generalize to novel data, even under occlusions, outperforming traditional deep networks in out-of-distribution tasks.
03:13:57 — Jiajun Wu: Generating Objects and Scenes and Worlds and what it means for computer vision
- This talk explores leveraging compositional structures for 3D generation, moving from single objects to complex scenes and entire worlds. It highlights the use of scene graphs to represent object relationships and attributes, enabling controlled generation and manipulation of 3D environments, with a focus on improving realism and consistency.
03:53:51 — Katerina Fragkiadaki: Image and Video Perception with Generative Feedback
- This talk introduces a generative feedback approach to perception, where discriminative models are adapted at test time using generative models. It demonstrates how this “Diffusion-TTA” method significantly boosts performance in image classification and segmentation tasks, particularly in out-of-distribution and online settings, by leveraging diffusion models to refine predictions and improve consistency.
04:28:59 — Andrea Vedaldi: 3D Generative AI Efficient, high-def & controllable
- This talk presents a comprehensive approach to 3D generative AI, focusing on efficiency, high-definition, and controllability. It introduces “Splatter Image” for fast single-view 3D reconstruction, “Free3D” for consistent multi-view generation, and “IM-3D” for high-quality texture generation, all leveraging diffusion models and emphasizing the importance of 3D-aware representations and efficient training.
05:31:40 — Federico Tombari: Generating 3D assets with Diffusion
- This talk introduces a novel approach to generating 3D assets using diffusion models, focusing on creating realistic and detailed textures for 3D meshes. It highlights the use of a scene graph representation to condition the diffusion process, enabling the generation of diverse and semantically consistent 3D scenes with controllable object attributes and relationships.

Methods / Models / Datasets Mentioned

Lingo-1
Lingo-2
GAIA (2023)
PRISM-1
WayveScenes101
Lemo
AMASS
Prox
EgoBody
EgoHMR
RoHM
DiT
U-Net
GPT-3
Chinchilla
Imagen
LDM
PixArt-α
Sora
LRM
GigaGAN
SDV1.5
DALL-E 2
T5
CLIP
Instruct-NeRF2NeRF
PIXART-δ
PIXART-Σ
SIT
DDPM
NeRF
Neus
VolSDF
UniSurf
Pascal3D+
OOD-CV
ResNet50
ConvNext
ViT-b-16
NOVUM
DreamFusion
TextMesh
CommonScenes
ATISS
MeshLRM
Diffusion-TTA
DreamScene4D
Zero-1-to-3
MV-Dream
Consistent4D
BPI
ZeroNVS
Splatter Image
Free3D
IM-3D
VQGAN
PCA

Topics

3D Generative AI · Diffusion Models · Text-to-3D Generation · Scene Graphs · Embodied AI · Human Motion Estimation · Neural Distance Fields · Novel View Synthesis · Scalability in 3D Generation · Test-Time Adaptation

Notes

Open for commentary — connections to other work, critiques, follow-up reading.