2’nd Workshop for Learning 3D with Multi-View Supervision (3DMV) at CVPR 2024

Event: CVPR 2024 3DMV Workshop · Duration: 317 min · ▶ Watch on YouTube

Abstract

The 2nd Workshop for Learning 3D with Multi-View Supervision (3DMV) at CVPR 2024 explores the latest advancements and challenges in 3D and 4D vision. The workshop highlights the increasing importance of spatial indoor computing and the transformative impact of foundation models and diffusion models on 3D content generation and understanding. Discussions cover topics from efficient 3D reconstruction and novel view synthesis to the development of 4D generative models and robust evaluation metrics. A key focus is on leveraging prior knowledge and multi-view supervision to overcome limitations in data, scale, and dynamic scene understanding.

Speakers

  • Abdullah Hamdi — University of Oxford
  • Matthias Niessner — TUM
  • Ziwei Liu — Nanyang Technological University
  • Deva Ramanan — CMU
  • David Novotny — Meta AI Research
  • Andrea Tagliasacchi — Google, Simon Fraser University

Talks (6)

  • 00:00:00 — Abdullah Hamdi: 2’nd Workshop for Learning 3D with Multi-View Supervision (3DMV) at CVPR 2024
    • Introduction to the 2nd 3DMV workshop, its goals, and the increasing importance of spatial indoor computing and 3D/4D vision, highlighting the role of multi-view supervision and recent advancements in foundation models.
  • 01:50:00Matthias Niessner: AI Generation of Immersive 3D Worlds
    • Discusses challenges in 3D content creation for realistic virtual worlds, moving from traditional 3D reconstruction methods (like Kinect Fusion, Voxel Hashing) to generative models (like diffusion models for 3D shapes and scenes), and the limitations of implicit surface representations for game-ready assets.
  • 01:08:46Ziwei Liu: 3DTopia: Foundation Ecosystem for 3D Generative Models
    • Introduces 3DTopia, an ecosystem for 3D generative models, highlighting the progress in learning 3D from multi-view supervision, including efficiency, regularization, diffusion priors, and foundation models. Discusses methods for dataset preparation, hybrid diffusion priors for text-to-3D generation, and autoregressive reconstruction for 4D content.
  • 02:19:00Deva Ramanan: Scaling Multiview Reconstruction over Space and Time
    • Explores challenges and solutions for scaling multi-view reconstruction to large-scale urban environments and dynamic scenes, emphasizing the importance of graphics representations, foundational losses, robust camera initializations, and multimodal learning for 4D applications.
  • 03:05:15David Novotny: From 2D Portraits to 3D Realities: Advancing GAN Inversion for Enhanced Image Synthesis
    • Discusses the challenges of 2D-to-3D GAN inversion, focusing on creating lighter, faster, and better-performing methods for generating 3D models from 2D images. Introduces a novel framework using a latent space for 3D image generation, multi-view consistency loss, and practical implications for rapid 2D-to-3D conversion.
  • 03:31:30Andrea Tagliasacchi: Make-it-Real: Reliable material inference and modelling
    • Explores the challenges of inverse rendering, particularly for complex scenes with varying materials and lighting. Introduces a method that leverages multimodal large language models to provide priors for this ill-posed problem, enabling reliable material inference and modeling.

Key Takeaways

  • The field of 3D and 4D vision is experiencing rapid growth, driven by advancements in foundation models and diffusion models.
  • Combining multi-view supervision with implicit representations and neural rendering techniques is crucial for high-quality 3D reconstruction and generation.
  • Addressing challenges like data scarcity, scale, and dynamic scene understanding requires innovative approaches, including hierarchical representations and leveraging prior knowledge.
  • The development of robust evaluation metrics and benchmarks, especially for generative models, is essential for advancing the field.
  • Future directions involve integrating multimodal data, exploring novel camera parameterizations, and developing efficient methods for real-time 4D content creation and manipulation.

Methods / Models / Datasets Mentioned

  • Kinect Fusion
  • Voxel Hashing
  • AlexNet
  • VIT
  • MVCNN
  • Segment Anything
  • DreamFusion
  • Segment Anything in 3D with NeRFs
  • Magic123
  • MVDream
  • DUST3R
  • Ego-Exo 4D
  • CNNComplete
  • 3DShape2VecSet
  • PolyDiff
  • BlockNeRF
  • PyNeRF
  • MipNeRF
  • ZipNeRF
  • MegaNERF
  • HybridNERF
  • VR-NeRF
  • Total-Recon
  • L4GM
  • DG4D
  • DreamGaussian4D
  • GPT4Eval
  • RelPose
  • RelPose++
  • CamerasAsRays
  • PoseDiffusion
  • COLMAP
  • Implicit-PDF
  • EG3D
  • StyleGAN
  • StyleGAN2
  • LLaVA
  • Vicuna
  • GPT
  • Point-E
  • Shap-E
  • Instant3D
  • LGM
  • LATTE3D
  • HyperDreamer
  • SAM
  • PBR

Topics

3D Reconstruction · Multi-view Supervision · 4D Vision · Generative Models · Diffusion Models · Foundation Models · Spatial Indoor Computing · Implicit Representations · Neural Rendering · Camera Pose Estimation


Notes

Open for commentary — connections to other work, critiques, follow-up reading.