2’nd Workshop for Learning 3D with Multi-View Supervision (3DMV) at CVPR 2024

Event: CVPR 2024 3DMV Workshop · Duration: 317 min · ▶ Watch on YouTube

Abstract

The 2nd Workshop for Learning 3D with Multi-View Supervision (3DMV) at CVPR 2024 explores the latest advancements and challenges in 3D and 4D vision. The workshop highlights the increasing importance of spatial indoor computing and the transformative impact of foundation models and diffusion models on 3D content generation and understanding. Discussions cover topics from efficient 3D reconstruction and novel view synthesis to the development of 4D generative models and robust evaluation metrics. A key focus is on leveraging prior knowledge and multi-view supervision to overcome limitations in data, scale, and dynamic scene understanding.

Speakers

Abdullah Hamdi — University of Oxford
Matthias Niessner — TUM
Ziwei Liu — Nanyang Technological University
Deva Ramanan — CMU
David Novotny — Meta AI Research
Andrea Tagliasacchi — Google, Simon Fraser University

Talks (6)

00:00:00 — Abdullah Hamdi: 2’nd Workshop for Learning 3D with Multi-View Supervision (3DMV) at CVPR 2024
- Introduction to the 2nd 3DMV workshop, its goals, and the increasing importance of spatial indoor computing and 3D/4D vision, highlighting the role of multi-view supervision and recent advancements in foundation models.
01:50:00 — Matthias Niessner: AI Generation of Immersive 3D Worlds
- Discusses challenges in 3D content creation for realistic virtual worlds, moving from traditional 3D reconstruction methods (like Kinect Fusion, Voxel Hashing) to generative models (like diffusion models for 3D shapes and scenes), and the limitations of implicit surface representations for game-ready assets.
01:08:46 — Ziwei Liu: 3DTopia: Foundation Ecosystem for 3D Generative Models
- Introduces 3DTopia, an ecosystem for 3D generative models, highlighting the progress in learning 3D from multi-view supervision, including efficiency, regularization, diffusion priors, and foundation models. Discusses methods for dataset preparation, hybrid diffusion priors for text-to-3D generation, and autoregressive reconstruction for 4D content.
02:19:00 — Deva Ramanan: Scaling Multiview Reconstruction over Space and Time
- Explores challenges and solutions for scaling multi-view reconstruction to large-scale urban environments and dynamic scenes, emphasizing the importance of graphics representations, foundational losses, robust camera initializations, and multimodal learning for 4D applications.
03:05:15 — David Novotny: From 2D Portraits to 3D Realities: Advancing GAN Inversion for Enhanced Image Synthesis
- Discusses the challenges of 2D-to-3D GAN inversion, focusing on creating lighter, faster, and better-performing methods for generating 3D models from 2D images. Introduces a novel framework using a latent space for 3D image generation, multi-view consistency loss, and practical implications for rapid 2D-to-3D conversion.
03:31:30 — Andrea Tagliasacchi: Make-it-Real: Reliable material inference and modelling
- Explores the challenges of inverse rendering, particularly for complex scenes with varying materials and lighting. Introduces a method that leverages multimodal large language models to provide priors for this ill-posed problem, enabling reliable material inference and modeling.

Key Takeaways

The field of 3D and 4D vision is experiencing rapid growth, driven by advancements in foundation models and diffusion models.
Combining multi-view supervision with implicit representations and neural rendering techniques is crucial for high-quality 3D reconstruction and generation.
Addressing challenges like data scarcity, scale, and dynamic scene understanding requires innovative approaches, including hierarchical representations and leveraging prior knowledge.
The development of robust evaluation metrics and benchmarks, especially for generative models, is essential for advancing the field.
Future directions involve integrating multimodal data, exploring novel camera parameterizations, and developing efficient methods for real-time 4D content creation and manipulation.

Methods / Models / Datasets Mentioned

Kinect Fusion
Voxel Hashing
AlexNet
VIT
MVCNN
Segment Anything
DreamFusion
Segment Anything in 3D with NeRFs
Magic123
MVDream
DUST3R
Ego-Exo 4D
CNNComplete
3DShape2VecSet
PolyDiff
BlockNeRF
PyNeRF
MipNeRF
ZipNeRF
MegaNERF
HybridNERF
VR-NeRF
Total-Recon
L4GM
DG4D
DreamGaussian4D
GPT4Eval
RelPose
RelPose++
CamerasAsRays
PoseDiffusion
COLMAP
Implicit-PDF
EG3D
StyleGAN
StyleGAN2
LLaVA
Vicuna
GPT
Point-E
Shap-E
Instant3D
LGM
LATTE3D
HyperDreamer
SAM
PBR

Topics

3D Reconstruction · Multi-view Supervision · 4D Vision · Generative Models · Diffusion Models · Foundation Models · Spatial Indoor Computing · Implicit Representations · Neural Rendering · Camera Pose Estimation

Notes

Open for commentary — connections to other work, critiques, follow-up reading.