The First Workshop on AI for 3D Generation

Event: CVPR 2024 · Duration: 516 min · ▶ Watch on YouTube

Abstract

This segment covers the opening remarks of the “First Workshop on AI for 3D Generation” at CVPR 2024, followed by three keynote/spotlight talks. The presentations delve into efficient, high-definition, and controllable 3D generative AI, exploring text-to-3D and image-to-3D generation, and addressing challenges like data scarcity and multi-view consistency. The talks also highlight the application of 3D generation for physical intelligence in robotics, including robot physical reconstruction, design, and interaction, and introduce multi-modal generation, demonstrating how image generation can be controlled by various inputs like text, spatial signals, and audio. This segment features three talks on 3D generative models. Duygu Ceylan from Adobe discusses extending 2D image generation to video editing using cross-frame attention and explores generative rendering for 3D. Dongsu Zhang from Seoul National University and NVIDIA presents Generative Cellular Automata (GCA) and its hierarchical version (hGCA) for scalable 3D scene completion from sparse LiDAR data, introducing concepts like sparse voxel embedding and a global consistency planner. Varun Jampani from Stability AI then details methods for adapting large image and video diffusion models for 3D generation, covering direct 3D generation, multi-view synthesis, and articulated shape reconstruction, including models like TripoSR, Stable Zero123, MVD-Fusion, SV3D, DreamBooth3D, and ARTIC3D. This segment explores advancements in 3D generative AI, starting with novel multi-view synthesis techniques like SV3D for consistent video generation and its application in 3D object reconstruction. It then delves into methods for predicting beyond basic 3D shapes and textures, including estimating HDR lighting from single images (DiffusionLight) and parametrically controlling material properties (Alchemist) or transferring materials (ZeST) using diffusion models. The latter part of the segment focuses on controllable 3D generation, addressing challenges in editing conditional radiance fields, improving alignment with 2D inputs using 3D labels, and ensuring cross-view consistency during editing. Finally, it introduces FlashTex, a solution for fast relightable mesh texturing that leverages LightControlNet and a distilled encoder to generate PBR materials from text prompts, effectively disentangling lighting from textures and significantly speeding up the 3D generation process. This segment introduces the challenges and opportunities in 3D content creation using AI, contrasting it with 2D image and video generation. The speaker highlights the scarcity of 3D data and the inherent complexity of 3D representations, which include geometry, materials, and lighting. The presentation then delves into how computer graphics techniques, particularly 3D mesh modeling, can be integrated with machine learning to overcome these challenges. The speaker proposes a differentiable iso-surfacing approach to bridge neural fields from AI with 3D meshes from graphics, enabling efficient and high-fidelity 3D content generation. This segment introduces Dream Machine, a video generation model from Luma AI, highlighting its ability to generate high-quality, realistic videos from text and images. The speaker discusses how Dream Machine learns 3D structure, light transport, and dynamics from watching videos, showcasing examples of generated videos with 3D camera movements, reflections, and physical interactions. The presentation also touches upon the scalability challenges of 3D data compared to 2D data, and proposes 3D as a finetune application of visual foundation models, demonstrating how Dream Machine can generate interactive 3D scenes from single images. This talk explores the evolution and future of volumetric generation, from historical 3D creations to cutting-edge AI models. It highlights the significant advancements in generative AI, particularly with NeRFs and diffusion models, enabling the creation of 3D and 4D content from images or text. The speaker addresses the challenge of 3D data scarcity and proposes leveraging 2D priors for 3D/4D generation, showcasing methods like DreamFusion, Instant3D, GTR, and SceneWiz. The presentation concludes by introducing 4Real, a novel approach for generating realistic 4D Gaussian splats from generated videos, emphasizing the potential of foundational video models for complex volumetric content creation.

Speakers

Davis Rempe — NVIDIA Research
Andrea Vedaldi — University of Oxford, Meta
Ruoshi Liu — Columbia University
Duygu Ceylan — Adobe Research
Georgios Pavlakos — UT Austin
Dongsu Zhang — Seoul National University, NVIDIA
Varun Jampani — Stability AI
Jun-Yan Zhu — CMU School of Computer Science
Jun Gao — University of Toronto, NVIDIA, Vector Institute
Alex Yu — Luma AI
Sergey Tulyakov — Snap Research

Talks (16)

00:00:17 — Andrea Vedaldi: 3D Generative AI: Efficient, high-def & controllable
- Discusses progress in 3D generative AI, focusing on text-to-3D and image-to-3D generation, addressing challenges like data scarcity and multi-view consistency using diffusion models and novel reconstruction techniques.
00:40:50 — Ruoshi Liu: 3D Generation for Physical Intelligence
- Explores how 3D generation can enhance physical intelligence in robotics, focusing on physical reconstruction, design, and interaction, leveraging large-scale datasets and differentiable rendering for robot control and tool design.
01:16:34 — Duygu Ceylan: Towards Multi-modal Generation
- Discusses advancements in multi-modal generative AI, particularly focusing on controlling image generation with text, spatial signals (depth, edges, pose), and audio, using diffusion models with cross-attention mechanisms.
01:26:01 — Duygu Ceylan: How about generating other modalities?
- Discusses extending 2D image generation models to video editing and 3D generative rendering, highlighting cross-frame attention and 3D-aware noise initialization.
01:36:01 — Dongsu Zhang: Scalable Scene Completion with Generative Cellular Automata
- Presents Generative Cellular Automata (GCA) and its hierarchical extension (hGCA) for scalable 3D scene completion from sparse and noisy LiDAR data, including continuous geometry and global consistency.
01:54:21 — Varun Jampani: Adapting Image and Video Generative Models for 3D Generation
- Explores techniques for generating 3D content from 2D inputs by adapting large image and video diffusion models, focusing on multi-view generation and articulated shape reconstruction.
02:52:05 — Varun Jampani: Novel Multi-view Synthesis – Static Orbits
- Introduces a method for novel multi-view synthesis using SV3D to generate consistent videos from a single image, supporting both static and dynamic camera orbits.
04:18:05 — Jun Gao: 3D Representations for 3D Content Creation
- Jun Gao introduces 3D representations for 3D content creation, highlighting the challenges of 3D data scarcity and complexity compared to 2D data, and proposing solutions leveraging computer graphics knowledge.
05:44:06 — Alex Yu: Dream Machine: Advancing 3D in an era of large models
- This talk introduces Dream Machine, a video generation model by Luma AI, and discusses its capabilities in learning 3D structure, light transport, and dynamics from videos, as well as its application in generating interactive 3D scenes from single images.
06:00:51 — Jun-Yan Zhu: Controllable 3D Generation
- Introduces the concept of controllable 3D generation, highlighting the challenges and opportunities in creating and editing 3D assets.
06:06:31 — Jun-Yan Zhu: Additional Editing Examples
- Provides more examples of color and shape editing on various 3D objects, showcasing the flexibility and effectiveness of the proposed editing method.
06:11:56 — Jun-Yan Zhu: Baseline: EG3D-c
- Introduces EG3D-c, a conditional baseline that encodes a 2D label map into a latent code to modulate the NeRF block, allowing for conditional 3D generation.
06:18:50 — Jun-Yan Zhu: Baked-in Lighting
- Discusses the problem of ‘baked-in lighting’ in text-to-3D generation, where lighting information is embedded into texture maps, making relighting difficult.
06:24:01 — Jun-Yan Zhu: Inference with LightControlNet
- Demonstrates LightControlNet’s inference capabilities, showing how it can generate different materials (leather, metal, wooden) while consistently following input lighting conditions.
06:29:11 — Jun-Yan Zhu: More results
- Showcases additional results of FlashTex generating various materials (wooden, stone, marble, metal) on different goblet shapes, with consistent relighting under various environment maps.
07:10:13 — Sergey Tulyakov: Volumetric Generation of Objects, Scenes, and Videos
- An overview of the evolution and future of volumetric generation, from historical 3D creations to modern AI models capable of generating 3D and 4D content.

Key Takeaways

3D generative AI is rapidly advancing, with diffusion models proving highly effective for text-to-3D and image-to-3D tasks, despite challenges like data scarcity.
Leveraging pre-trained 2D image/video generators and fine-tuning them for multi-view synthesis is a common and effective strategy for 3D content generation.
Differentiable rendering is a crucial technique for bridging the gap between visual data and robot control, enabling robots to learn from diverse visual inputs and perform complex physical interactions.
Multi-modal control signals, including text, spatial information (depth, pose), and audio, can significantly enhance the controllability and expressiveness of generative AI models for image and 3D content creation.
2D image generation techniques, particularly attention mechanisms, can be effectively adapted for video editing to maintain temporal consistency.
3D generative models are evolving towards universal models that can integrate various modalities (text, spatial control, audio) and output diverse representations (RGB, intrinsics, video, 3D).
Leveraging sparsity and connectivity through Generative Cellular Automata (GCA) allows for scalable and high-fidelity 3D scene completion, even from noisy and incomplete real-world data.
Adapting powerful 2D image and video diffusion models for 3D generation, through techniques like multi-view synthesis and novel view prompting, is a promising direction for creating diverse and controllable 3D content.
SV3D enables consistent multi-view video generation from single images, forming a foundation for high-quality 3D object reconstruction through optimization techniques like NeRF and DMTet.
Diffusion models can be extended beyond basic 3D shape generation to predict complex attributes like environment lighting (DiffusionLight) and precise material properties (Alchemist, ZeST), offering new avenues for 3D content creation.
Controllable 3D generation is enhanced by using 2D label maps as conditioning inputs and by explicitly predicting 3D semantic labels, leading to better alignment and cross-view consistency during editing.
FlashTex, powered by LightControlNet and a distilled encoder, provides a fast and efficient solution for generating relightable PBR materials from text prompts, effectively disentangling lighting from textures in 3D assets.
3D content creation with AI faces significant challenges due to 3D data scarcity (10M objects vs. 5B 2D images) and the inherent complexity of 3D representations (geometry, materials, lighting).
Neural fields from AI offer continuous 3D representations suitable for machine learning and complex geometry but suffer from redundancy, slow rendering, and implicit surface representation.
3D meshes from computer graphics provide compact modeling, fast rendering, and explicit surface focus but are not inherently suitable for machine learning and struggle with discrete topology generation.
A differentiable iso-surfacing approach can bridge neural fields and 3D meshes, allowing for gradient-based optimization of 3D content using visual, geometric, and physical objectives, leading to improved efficiency, fidelity, and new capabilities.
Leveraging foundation models and inverse graphics can alleviate data scarcity by extracting 3D structures from 2D data, enabling diverse and realistic 3D content generation.
Dream Machine is a video generation model capable of creating high-quality, realistic videos from text instructions and images, demonstrating advanced capabilities in learning 3D structure, light transport, and dynamics from video content.
The generation of interactive 3D scenes from single images is achieved by leveraging Dream Machine to produce multiple consistent views, which are then used to reconstruct a 3D model, significantly reducing the time and complexity compared to traditional methods.
3D data faces significant scalability challenges compared to 2D image and video data, with millions of 3D models/captures being hard to create and available in various formats, contrasting with billions of easily producible 2D images/videos in standard formats.
Future directions include improving precise camera control, addressing limitations in physics simulation (e.g., clipping, merging objects), and exploring 4D generation and embodied AI for applications like visual imagination in robotics.
3D/4D generation has seen rapid advancements, moving from manual sculpting to AI-driven methods, with NeRFs and diffusion models being foundational technologies.
The scarcity of 3D data compared to 2D data is a major bottleneck, suggesting that leveraging 2D priors is a promising path for 3D/4D generation.
Novel techniques like Instant3D, GTR, SceneWiz, and 4Real demonstrate increasing capability to generate complex objects, scenes, and dynamic 4D representations with improved quality and efficiency.
The future of 3D/4D generation likely involves reusing rich priors from 2D image and video models, enabling more realistic and controllable volumetric content.

Methods / Models / Datasets Mentioned

2D U-Net
3DIM
4D-fy
4Real
ARTIC3D
Adobe Firefly
Alchemist
CAT3D
CLIP
ChatGPT
Co-Tracker
Continuous Generative Cellular Automata (cGCA)
ControlNet
ConvOcc
DALL-E 3
DASS
DCGAN
DIBR
DMTet
DatasetGAN
Deep Local Implicit Fields (DeepLIF)
DeepSDF
DefGrid
DefTet
Denoising diffusion models
Diffusion
Diffusion Policy
Diffusion models
DiffusionLight
Dr. Robot
Dream Machine
DreamBooth
DreamBooth3D
DreamCraft3D
DreamFusion
DreamWaltz
Dreamfusion
Dreamitate
EG3D
EscherNet
Fantasia3D
Firefly
Flash3D
FlashTex
FlexiCubes
Free3D
GANs
GANverse3D
GET3D
GPT4
GRAF
GTR
Gaussian Shell Maps
Gaussian Splats
Gaussian Splatting
GaussianDreamer
Generative Cellular Automata (GCA)
Genie
GeoDream
Google Veo
Grid Search
HI-LASSIE
HIFA
Hierarchical Generative Cellular Automata (hGCA)
IFNet
IM-3D
ImageDream
Instant3D
JS3CNet
LLFF
LRM (Large Reconstruction Model)
Latte3D
Lightplane kernels
Luma AI
LumiGAN
MVD-Fusion
MVDream
Magic123
Magic3D
MipNeRF360
NeRF
NeRFs
Nvdiffr
Objaverse-XL
Occupancy Networks
OpenAI Sora
Particle Swarm Optimization
Pix2NeRF
Pix2Video
Point Tracker
PointNet
ProGAN
ProlificDreamer
RealFusion
SDF
SDS Loss
SG-NN
SNES
SPADE denormalization
SV3D
SceneTex
SceneWiz
Snap Video
SnapML Kit
SofGAN
SonicDiffusion
Sparse Voxel Embedding (SVE)
Splatter Image
Stable Diffusion Video
Stable Diffusion XL
Stable Video 3D (SV3D)
Stable Zero123
StyleGAN
SyncDreamer
Teddy system
Text-to-Video model
Text2Tex
Tri-Plane Representation
TripoSR
UniDepth
WGAN
ZeST
Zero-1-to-3
Zero123
Zero123-XL
Zero123XL
ZeroShape
ZipNeRF

Topics

3D Generative AI · 3D Generative Models · 3D content creation · 3D creation history · 3D data scalability · 3D data scarcity · 3D generation · 3D learning · 3D meshes · 3D reconstruction · 3D representations · 4D generation · AI · Articulated Shape Reconstruction · Audio-conditioned generation · Cellular Automata · Computer graphics · Controllable generation · Cross-view consistency · DMTet · Differentiable iso-surfacing · Differentiable rendering · Diffusion Models · Diffusion models · Dream Machine · FlashTex · Gaussian Splats · Generative AI · Generative Rendering · Image to 3D/4D · Image-to-3D generation · Layout generation · LightControlNet · Lighting estimation · Material properties · Multi-modal generation · Multi-view Synthesis · Multi-view consistency · Multi-view synthesis · NeRF · NeRFs · Neural fields · Physical intelligence · Robotics · SDS Loss · SV3D · Scene Completion · Scene composition · Text to 3D/4D · Text-to-3D generation · Texture generation · Tool design · VAE encoder distillation · Video Editing · Video generation · Volumetric generation · dynamics · interactive 3D scenes · light transport · video generation · visual foundation models

Notes

Open for commentary — connections to other work, critiques, follow-up reading.