3D Foundation Models for Physical Intelligence

Event: CVPR 2024 · Duration: 376 min · ▶ Watch on YouTube

Abstract

This segment introduces 3D foundation models as a crucial bridge between the virtual and physical worlds. It highlights the development of large-scale 3D datasets like Objaverse-XL, which provides 10M+ high-quality 3D assets, and the Zero123 model for novel view synthesis from 2D images. The talk then delves into applications in physical interaction, showcasing Dr. Robot for differentiable robot rendering and control, and Dreamitate for visuomotor policy learning via video generation. It also covers physical design, demonstrating an automated system for designing paper tools like airplanes and grippers, and physical reconstruction, using thermal cameras and differentiable rendering to reconstruct occluded humans. This segment provides a comprehensive overview of the evolution of 3D generative models, categorizing them into four generations based on their underlying techniques. It delves into the strengths and weaknesses of each generation, from differentiable rendering to multiview diffusion and native 3D generation. The speakers then introduce VecSet, an efficient transformer-based 3D representation developed by their team, and showcase its application in diffusion models and various extensions by other research groups. The presentation concludes by highlighting the ongoing challenges and future directions in achieving high-quality, efficient, and robust 3D content generation. This segment features two talks on 3D representations and understanding. Jun Gao discusses the challenges of 3D content creation with AI, introducing methods like differentiable iso-surfacing and MeshGPT for efficient 3D generation and text-to-3D. Angela Dai then emphasizes the importance of high-fidelity 3D data for understanding, presenting ScanNet++ as a benchmark dataset and ‘Adaptive Shells’ as a novel hybrid representation for efficient rendering. Both speakers highlight the benefits of combining graphics-based modeling with modern AI techniques to overcome limitations in 3D data scarcity and complexity. This segment features two talks on 3D vision. The first talk, “How to Prepare Data for 3D Generative Foundation Models?” by Hao Tan, delves into the complexities of data preparation for 3D generative models, highlighting the need for high-quality, aligned 3D data and the potential of leveraging large-scale 2D data for text-to-3D generation. The second talk, “From CroCo to MASt3R: A paradigm change in 3D vision?” by Jerome Revaud, introduces self-supervised learning methods, CroCo and MASt3R, which aim to unify various 3D vision tasks through cross-view completion and masked modeling, drawing inspiration from recent advancements in NLP foundation models. This segment introduces CroCo, a self-supervised learning method for 3D vision that inherently handles multi-view data and can reconstruct scenes from masked inputs. It then presents DUST3R, a model built upon CroCo, capable of dense unconstrained stereo 3D reconstruction by outputting pointmaps, which can then be used for various downstream tasks like camera calibration, depth estimation, and pose estimation without requiring explicit pose or intrinsic camera parameters. Finally, MAST3R is introduced, an extension of DUST3R that incorporates metric scale training and local features for improved matching, achieving state-of-the-art results in map-free relocalization and multi-view stereo. The speaker emphasizes that these models represent a paradigm shift towards unifying various 3D vision tasks within a single, robust, and fast framework.

Speakers

  • Ruoshi Liu — Columbia University
  • Oscar Michel
  • Matt Wallingford
  • Matt Deitke
  • Shivam Duggal
  • Yushi Hu
  • Peter Wonka — King Abdullah University of Science and Technology
  • Biao Zhang — King Abdullah University of Science and Technology
  • Jun Gao — University of Toronto, NVIDIA, Vector Institute
  • Angela Dai — Technical University of Munich
  • Hao Tan — Adobe Research
  • Jerome Revaud — NAVER Labs Europe
  • Vincent Leroy — Naverlabs Europe

Talks (9)

  • 00:00:00 — Ruoshi Liu: 3D Foundation Models for Physical Intelligence
    • This talk explores how 3D foundation models, particularly Objaverse-XL and Zero123, can bridge the gap between 2D and 3D data, enabling advancements in physical interaction, design, and reconstruction for robotics and other applications, while also introducing new evaluation metrics and differentiable rendering techniques.
  • 01:15:17Peter Wonka: Towards Training Large 3D Generative Model
    • This part of the talk provides an overview of the evolution of 3D generative models across four generations, discussing their respective strengths and weaknesses, and emphasizing the importance of efficient 3D representations for high-quality geometry.
  • 01:27:17Biao Zhang: Towards Finding Efficient Representation for 3D Data
    • This part introduces VecSet, a transformer-based efficient 3D representation, detailing its architecture and demonstrating its application in diffusion models, alongside various extensions developed by other researchers.
  • 02:30:45Jun Gao: 3D Representations for 3D Content Creation
    • Jun Gao discusses challenges in 3D content creation using AI, introducing differentiable iso-surfacing and MeshGPT for efficient 3D generation, text-to-3D, and adaptive shells for radiance field rendering.
  • 02:36:35Angela Dai: From Quantity to Quality for 3D Understanding
    • Angela Dai presents ScanNet++ as a high-fidelity 3D indoor scene dataset and introduces ‘Adaptive Shells’ as a hybrid representation for efficient and high-quality 3D rendering, emphasizing the shift from quantity to quality in 3D data for understanding.
  • 03:45:52Hao Tan: How to Prepare Data for 3D Generative Foundation Models?
    • This talk discusses the challenges and strategies for preparing data for 3D generative foundation models, emphasizing the importance of high-quality, aligned 3D data and leveraging large-scale 2D data for text-to-3D generation.
  • 03:51:42Jerome Revaud: From CroCo to MASt3R: A paradigm change in 3D vision?
    • This talk introduces CroCo and MASt3R, self-supervised learning methods for 3D vision that aim to unify various 3D tasks by leveraging cross-view completion and masked modeling, inspired by MAE and NLP foundation models.
  • 05:01:10Vincent Leroy: CroCo: Self-supervised learning with Cross-View Completion
    • Introduces CroCo, a self-supervised learning method for 3D vision that inherently handles multi-view data and can reconstruct scenes from masked inputs, demonstrating its ability to understand scene geometry and relative camera poses.
  • 05:07:30Vincent Leroy: MAST3R: Matching And Stereo 3D Reconstruction
    • Introduces MAST3R, an extension of DUST3R that incorporates metric scale training and local features for improved matching, achieving state-of-the-art results in map-free relocalization and multi-view stereo.

Key Takeaways

  • Large-scale, high-quality 3D datasets like Objaverse-XL are crucial for advancing 3D generative models and robotics, bridging the gap between virtual and physical intelligence.
  • Differentiable robot rendering (Dr. Robot) enables seamless integration of visual foundation models with robot control, allowing for tasks like pose estimation, motion retargeting, and text-guided robot actions.
  • Automated physical design frameworks, exemplified by designing paper airplanes and Kirigami grippers, leverage surrogate models and iterative optimization to discover novel designs that outperform human-designed counterparts.
  • Thermal cameras and differentiable rendering of reflections offer a novel approach to 3D reconstruction of occluded humans and objects, by effectively turning reflective surfaces into ‘mirrors’ in the infrared spectrum.
  • The field of 3D generative models has rapidly advanced through four distinct generations, each offering different trade-offs in terms of data requirements, generation speed, and output quality.
  • Efficient 3D representations, such as VecSet, are critical for achieving high-quality geometry and enabling scalable native 3D generation, especially for applications like games and 3D content creation.
  • Transformer-based architectures, exemplified by DIT, are proving highly effective in training diffusion models with these efficient 3D representations, allowing for both unconditional and conditional generation.
  • The VecSet representation is versatile, supporting various extensions including 4D dynamic data, integration with geometric and physical priors, and texturizing, demonstrating its broad applicability across different 3D generation tasks.
  • Functional Diffusion offers a promising one-stage, end-to-end approach for generating infinite-dimensional functions directly, which can be applied to create complex 3D shapes like Signed Distance Functions (SDFs) by solving partial differential equations.
  • 3D content creation faces challenges due to data scarcity and high complexity compared to 2D, but generative AI offers promising solutions.
  • Differentiable iso-surfacing and transformer-based mesh generation (MeshGPT) enable efficient and high-quality 3D content creation, including text-to-3D applications.
  • High-fidelity 3D datasets like ScanNet++ are crucial for advancing 3D understanding, providing detailed geometry, appearance, and semantic annotations.
  • Adaptive Shells offer a flexible and efficient 3D representation that combines volume and surface rendering, adapting to local geometric complexity for improved rendering quality.
  • High-quality, aligned 3D data is crucial for training effective 3D generative foundation models, but its scarcity necessitates innovative approaches like leveraging large-scale 2D data.
  • Text-to-3D generation benefits significantly from large-scale 2D image-text datasets, as 2D diffusion models can provide strong priors for 3D interaction generation.
  • Self-supervised learning methods like CroCo and MASt3R offer a promising path towards unified 3D vision models, capable of addressing multiple tasks simultaneously by learning robust 3D representations.
  • The development of large-scale, diverse 3D datasets, both synthetic (e.g., Objaverse) and real (e.g., MVImgNet, MVHumanNet), is fundamental for advancing 3D generative and reconstruction tasks.
  • CroCo is a self-supervised, multi-view 3D vision model capable of reconstructing scenes and estimating relative poses from masked inputs.
  • DUST3R and MAST3R unify various 3D vision tasks (depth, pose, reconstruction) into a single framework by outputting dense pointmaps, eliminating the need for explicit camera parameters.
  • These models achieve state-of-the-art performance across multiple downstream tasks, including map-free relocalization and multi-view stereo, demonstrating robustness to unconstrained input conditions.
  • The pointmap representation allows for seamless derivation of various 3D vision outputs, and the architecture is simple, efficient, and scalable.

Methods / Models / Datasets Mentioned

  • 360-1M dataset
  • 3D Gaussians
  • 3D-R2N2
  • 3D-VAE-GAN
  • 3DILG
  • 3DIT (Ours)
  • 3DShape2Vecset
  • Appearance Deformation
  • ArkitScenes
  • AtlasNet
  • BlendedMVS
  • BundleFusion
  • CAT3D
  • CHORES (val)
  • CLAY
  • CLIP
  • CO3Dv2
  • COLMAP
  • CRM
  • Co-Tracker
  • Conditional Diffusion Model
  • Consistent123
  • Craftsman
  • Cricut Maker 3
  • CroCo
  • DDPM
  • DIBR
  • DINOv2
  • DIT
  • DMTet
  • DTU
  • DUST3R
  • DUSt3R
  • Dalle-2
  • DatasetGAN
  • DeepSDF
  • DefGrid
  • DefTet
  • Depth Anything
  • Diffusion Policy
  • Dr. Robot
  • DreamCraft3D
  • DreamFusion
  • DreamGaussian
  • Dreamitate
  • ECG3D
  • EG3D
  • Eval3D
  • Fantasia3D
  • FlexiCubes
  • Flows
  • Forward Kinematics
  • Functional Diffusion
  • GAN
  • GANVerse3D
  • GEM3D
  • GET3D
  • GPLD3D
  • GPT-4o
  • GPT4
  • GRAF
  • GRM
  • Gaussian Splatting
  • GaussianDreamer
  • GenZI
  • Habitat simulator
  • HiFA
  • HoloDeck
  • IM-Net
  • ImageDream
  • ImageNet
  • Imagen
  • Implicit LBS
  • InLoc
  • InfoNCE loss
  • Instant-NGP
  • Instant3D
  • InstantMesh
  • InstantSplat
  • LAION
  • LAS-Diffusion
  • LASA
  • LATTE3D
  • LGM
  • LION
  • LLaVA
  • LRM
  • Large Language Models
  • Latent-NeRF
  • Learning Implicit Fields
  • Llama
  • LucidDreamer
  • MAE
  • MAST3R
  • MASt3R
  • MCC
  • MNIST
  • MV-Dream
  • MVD2
  • MVDiff
  • MVDiffusion
  • MVDream
  • MVHumanNet
  • MVImgNet
  • Magic123
  • Magic3D
  • Make-Your-3D
  • MegaDepth
  • MeshGPT
  • Meta 3D Gen
  • Metric3D
  • Michelangelo
  • Midjourney V5
  • Motion2VecSets
  • Muse (autoregressive)
  • NeRF
  • Nerfacto
  • Nvidiffrec
  • OBJECT 3DIT
  • ODIN
  • Objaverse
  • Objaverse XL
  • OccNet
  • Occupancy Network
  • OccupancyNetwork
  • Odin
  • One-2-3-45
  • PanoHead
  • Plenoxels
  • Point-E
  • PointCloud GAN
  • PointSetGen
  • ProlificDreamer
  • RealEstate10K
  • RfD-Net
  • RichDreamer
  • Rodin
  • SAM
  • SCoDA
  • SPAD
  • SPRING Dataset & Benchmark
  • SV3D
  • ScanNet
  • ScanNet++
  • ScanNet200
  • Score Jacobian Chain
  • Shap-E
  • Shape-E
  • ShapeNet
  • SphereHead
  • Stable Diffusion
  • Stable Video Diffusion
  • StableDreamer
  • Static Thing 3D
  • StyleGAN
  • SyncDreamer
  • Text2Tex
  • Text2Video Model
  • TextMesh
  • Total3D
  • UniDepth
  • VAE
  • VGGSfM
  • VQVAE
  • Variational Score Distillation
  • Visual MPC
  • Waymo
  • Wonder3D
  • XCube
  • Zeor123++
  • Zero-1-to-3
  • Zero123
  • Zero123-XL

Topics

3D Foundation Models · 3D Generative Models · 3D Reconstruction · 3D complexity · 3D content creation · 3D generative models · 3D reconstruction · 3D representations · 3D vision · Automated Design · Dense reconstruction · Depth estimation · Differentiable Rendering · Diffusion Models · Efficient 3D Representation · Geometry Quality · Multi-view stereo · Multiview Diffusion · Native 3D Generation · Physical Intelligence · Pointmaps · Robotics · Self-supervised learning · Synthetic Data · Transformer Architectures · Transformers · Visual localization · Visuomotor Policy · adaptive shells · cross-view completion · data preparation · data scarcity · differentiable iso-surfacing · foundation models · generative AI · high-fidelity 3D data · indoor scene reconstruction · inverse graphics · masked modeling · mesh generation · neural fields · novel view synthesis · self-supervised learning · semantic understanding · surface rendering · text-to-3D · volume rendering


Notes

Open for commentary — connections to other work, critiques, follow-up reading.