3D Foundation Models for Physical Intelligence

Event: CVPR 2024 · Duration: 376 min · ▶ Watch on YouTube

Abstract

This segment introduces 3D foundation models as a crucial bridge between the virtual and physical worlds. It highlights the development of large-scale 3D datasets like Objaverse-XL, which provides 10M+ high-quality 3D assets, and the Zero123 model for novel view synthesis from 2D images. The talk then delves into applications in physical interaction, showcasing Dr. Robot for differentiable robot rendering and control, and Dreamitate for visuomotor policy learning via video generation. It also covers physical design, demonstrating an automated system for designing paper tools like airplanes and grippers, and physical reconstruction, using thermal cameras and differentiable rendering to reconstruct occluded humans. This segment provides a comprehensive overview of the evolution of 3D generative models, categorizing them into four generations based on their underlying techniques. It delves into the strengths and weaknesses of each generation, from differentiable rendering to multiview diffusion and native 3D generation. The speakers then introduce VecSet, an efficient transformer-based 3D representation developed by their team, and showcase its application in diffusion models and various extensions by other research groups. The presentation concludes by highlighting the ongoing challenges and future directions in achieving high-quality, efficient, and robust 3D content generation. This segment features two talks on 3D representations and understanding. Jun Gao discusses the challenges of 3D content creation with AI, introducing methods like differentiable iso-surfacing and MeshGPT for efficient 3D generation and text-to-3D. Angela Dai then emphasizes the importance of high-fidelity 3D data for understanding, presenting ScanNet++ as a benchmark dataset and ‘Adaptive Shells’ as a novel hybrid representation for efficient rendering. Both speakers highlight the benefits of combining graphics-based modeling with modern AI techniques to overcome limitations in 3D data scarcity and complexity. This segment features two talks on 3D vision. The first talk, “How to Prepare Data for 3D Generative Foundation Models?” by Hao Tan, delves into the complexities of data preparation for 3D generative models, highlighting the need for high-quality, aligned 3D data and the potential of leveraging large-scale 2D data for text-to-3D generation. The second talk, “From CroCo to MASt3R: A paradigm change in 3D vision?” by Jerome Revaud, introduces self-supervised learning methods, CroCo and MASt3R, which aim to unify various 3D vision tasks through cross-view completion and masked modeling, drawing inspiration from recent advancements in NLP foundation models. This segment introduces CroCo, a self-supervised learning method for 3D vision that inherently handles multi-view data and can reconstruct scenes from masked inputs. It then presents DUST3R, a model built upon CroCo, capable of dense unconstrained stereo 3D reconstruction by outputting pointmaps, which can then be used for various downstream tasks like camera calibration, depth estimation, and pose estimation without requiring explicit pose or intrinsic camera parameters. Finally, MAST3R is introduced, an extension of DUST3R that incorporates metric scale training and local features for improved matching, achieving state-of-the-art results in map-free relocalization and multi-view stereo. The speaker emphasizes that these models represent a paradigm shift towards unifying various 3D vision tasks within a single, robust, and fast framework.

Speakers

Ruoshi Liu — Columbia University
Oscar Michel
Matt Wallingford
Matt Deitke
Shivam Duggal
Yushi Hu
Peter Wonka — King Abdullah University of Science and Technology
Biao Zhang — King Abdullah University of Science and Technology
Jun Gao — University of Toronto, NVIDIA, Vector Institute
Angela Dai — Technical University of Munich
Hao Tan — Adobe Research
Jerome Revaud — NAVER Labs Europe
Vincent Leroy — Naverlabs Europe

Talks (9)

00:00:00 — Ruoshi Liu: 3D Foundation Models for Physical Intelligence
- This talk explores how 3D foundation models, particularly Objaverse-XL and Zero123, can bridge the gap between 2D and 3D data, enabling advancements in physical interaction, design, and reconstruction for robotics and other applications, while also introducing new evaluation metrics and differentiable rendering techniques.
01:15:17 — Peter Wonka: Towards Training Large 3D Generative Model
- This part of the talk provides an overview of the evolution of 3D generative models across four generations, discussing their respective strengths and weaknesses, and emphasizing the importance of efficient 3D representations for high-quality geometry.
01:27:17 — Biao Zhang: Towards Finding Efficient Representation for 3D Data
- This part introduces VecSet, a transformer-based efficient 3D representation, detailing its architecture and demonstrating its application in diffusion models, alongside various extensions developed by other researchers.
02:30:45 — Jun Gao: 3D Representations for 3D Content Creation
- Jun Gao discusses challenges in 3D content creation using AI, introducing differentiable iso-surfacing and MeshGPT for efficient 3D generation, text-to-3D, and adaptive shells for radiance field rendering.
02:36:35 — Angela Dai: From Quantity to Quality for 3D Understanding
- Angela Dai presents ScanNet++ as a high-fidelity 3D indoor scene dataset and introduces ‘Adaptive Shells’ as a hybrid representation for efficient and high-quality 3D rendering, emphasizing the shift from quantity to quality in 3D data for understanding.
03:45:52 — Hao Tan: How to Prepare Data for 3D Generative Foundation Models?
- This talk discusses the challenges and strategies for preparing data for 3D generative foundation models, emphasizing the importance of high-quality, aligned 3D data and leveraging large-scale 2D data for text-to-3D generation.
03:51:42 — Jerome Revaud: From CroCo to MASt3R: A paradigm change in 3D vision?
- This talk introduces CroCo and MASt3R, self-supervised learning methods for 3D vision that aim to unify various 3D tasks by leveraging cross-view completion and masked modeling, inspired by MAE and NLP foundation models.
05:01:10 — Vincent Leroy: CroCo: Self-supervised learning with Cross-View Completion
- Introduces CroCo, a self-supervised learning method for 3D vision that inherently handles multi-view data and can reconstruct scenes from masked inputs, demonstrating its ability to understand scene geometry and relative camera poses.
05:07:30 — Vincent Leroy: MAST3R: Matching And Stereo 3D Reconstruction
- Introduces MAST3R, an extension of DUST3R that incorporates metric scale training and local features for improved matching, achieving state-of-the-art results in map-free relocalization and multi-view stereo.

Key Takeaways

Large-scale, high-quality 3D datasets like Objaverse-XL are crucial for advancing 3D generative models and robotics, bridging the gap between virtual and physical intelligence.
Differentiable robot rendering (Dr. Robot) enables seamless integration of visual foundation models with robot control, allowing for tasks like pose estimation, motion retargeting, and text-guided robot actions.
Automated physical design frameworks, exemplified by designing paper airplanes and Kirigami grippers, leverage surrogate models and iterative optimization to discover novel designs that outperform human-designed counterparts.
Thermal cameras and differentiable rendering of reflections offer a novel approach to 3D reconstruction of occluded humans and objects, by effectively turning reflective surfaces into ‘mirrors’ in the infrared spectrum.
The field of 3D generative models has rapidly advanced through four distinct generations, each offering different trade-offs in terms of data requirements, generation speed, and output quality.
Efficient 3D representations, such as VecSet, are critical for achieving high-quality geometry and enabling scalable native 3D generation, especially for applications like games and 3D content creation.
Transformer-based architectures, exemplified by DIT, are proving highly effective in training diffusion models with these efficient 3D representations, allowing for both unconditional and conditional generation.
The VecSet representation is versatile, supporting various extensions including 4D dynamic data, integration with geometric and physical priors, and texturizing, demonstrating its broad applicability across different 3D generation tasks.
Functional Diffusion offers a promising one-stage, end-to-end approach for generating infinite-dimensional functions directly, which can be applied to create complex 3D shapes like Signed Distance Functions (SDFs) by solving partial differential equations.
3D content creation faces challenges due to data scarcity and high complexity compared to 2D, but generative AI offers promising solutions.
Differentiable iso-surfacing and transformer-based mesh generation (MeshGPT) enable efficient and high-quality 3D content creation, including text-to-3D applications.
High-fidelity 3D datasets like ScanNet++ are crucial for advancing 3D understanding, providing detailed geometry, appearance, and semantic annotations.
Adaptive Shells offer a flexible and efficient 3D representation that combines volume and surface rendering, adapting to local geometric complexity for improved rendering quality.
High-quality, aligned 3D data is crucial for training effective 3D generative foundation models, but its scarcity necessitates innovative approaches like leveraging large-scale 2D data.
Text-to-3D generation benefits significantly from large-scale 2D image-text datasets, as 2D diffusion models can provide strong priors for 3D interaction generation.
Self-supervised learning methods like CroCo and MASt3R offer a promising path towards unified 3D vision models, capable of addressing multiple tasks simultaneously by learning robust 3D representations.
The development of large-scale, diverse 3D datasets, both synthetic (e.g., Objaverse) and real (e.g., MVImgNet, MVHumanNet), is fundamental for advancing 3D generative and reconstruction tasks.
CroCo is a self-supervised, multi-view 3D vision model capable of reconstructing scenes and estimating relative poses from masked inputs.
DUST3R and MAST3R unify various 3D vision tasks (depth, pose, reconstruction) into a single framework by outputting dense pointmaps, eliminating the need for explicit camera parameters.
These models achieve state-of-the-art performance across multiple downstream tasks, including map-free relocalization and multi-view stereo, demonstrating robustness to unconstrained input conditions.
The pointmap representation allows for seamless derivation of various 3D vision outputs, and the architecture is simple, efficient, and scalable.

Methods / Models / Datasets Mentioned

360-1M dataset
3D Gaussians
3D-R2N2
3D-VAE-GAN
3DILG
3DIT (Ours)
3DShape2Vecset
Appearance Deformation
ArkitScenes
AtlasNet
BlendedMVS
BundleFusion
CAT3D
CHORES (val)
CLAY
CLIP
CO3Dv2
COLMAP
CRM
Co-Tracker
Conditional Diffusion Model
Consistent123
Craftsman
Cricut Maker 3
CroCo
DDPM
DIBR
DINOv2
DIT
DMTet
DTU
DUST3R
DUSt3R
Dalle-2
DatasetGAN
DeepSDF
DefGrid
DefTet
Depth Anything
Diffusion Policy
Dr. Robot
DreamCraft3D
DreamFusion
DreamGaussian
Dreamitate
ECG3D
EG3D
Eval3D
Fantasia3D
FlexiCubes
Flows
Forward Kinematics
Functional Diffusion
GAN
GANVerse3D
GEM3D
GET3D
GPLD3D
GPT-4o
GPT4
GRAF
GRM
Gaussian Splatting
GaussianDreamer
GenZI
Habitat simulator
HiFA
HoloDeck
IM-Net
ImageDream
ImageNet
Imagen
Implicit LBS
InLoc
InfoNCE loss
Instant-NGP
Instant3D
InstantMesh
InstantSplat
LAION
LAS-Diffusion
LASA
LATTE3D
LGM
LION
LLaVA
LRM
Large Language Models
Latent-NeRF
Learning Implicit Fields
Llama
LucidDreamer
MAE
MAST3R
MASt3R
MCC
MNIST
MV-Dream
MVD2
MVDiff
MVDiffusion
MVDream
MVHumanNet
MVImgNet
Magic123
Magic3D
Make-Your-3D
MegaDepth
MeshGPT
Meta 3D Gen
Metric3D
Michelangelo
Midjourney V5
Motion2VecSets
Muse (autoregressive)
NeRF
Nerfacto
Nvidiffrec
OBJECT 3DIT
ODIN
Objaverse
Objaverse XL
OccNet
Occupancy Network
OccupancyNetwork
Odin
One-2-3-45
PanoHead
Plenoxels
Point-E
PointCloud GAN
PointSetGen
ProlificDreamer
RealEstate10K
RfD-Net
RichDreamer
Rodin
SAM
SCoDA
SPAD
SPRING Dataset & Benchmark
SV3D
ScanNet
ScanNet++
ScanNet200
Score Jacobian Chain
Shap-E
Shape-E
ShapeNet
SphereHead
Stable Diffusion
Stable Video Diffusion
StableDreamer
Static Thing 3D
StyleGAN
SyncDreamer
Text2Tex
Text2Video Model
TextMesh
Total3D
UniDepth
VAE
VGGSfM
VQVAE
Variational Score Distillation
Visual MPC
Waymo
Wonder3D
XCube
Zeor123++
Zero-1-to-3
Zero123
Zero123-XL

Topics

3D Foundation Models · 3D Generative Models · 3D Reconstruction · 3D complexity · 3D content creation · 3D generative models · 3D reconstruction · 3D representations · 3D vision · Automated Design · Dense reconstruction · Depth estimation · Differentiable Rendering · Diffusion Models · Efficient 3D Representation · Geometry Quality · Multi-view stereo · Multiview Diffusion · Native 3D Generation · Physical Intelligence · Pointmaps · Robotics · Self-supervised learning · Synthetic Data · Transformer Architectures · Transformers · Visual localization · Visuomotor Policy · adaptive shells · cross-view completion · data preparation · data scarcity · differentiable iso-surfacing · foundation models · generative AI · high-fidelity 3D data · indoor scene reconstruction · inverse graphics · masked modeling · mesh generation · neural fields · novel view synthesis · self-supervised learning · semantic understanding · surface rendering · text-to-3D · volume rendering

Notes

Open for commentary — connections to other work, critiques, follow-up reading.