Cross-Modal 3D Scene Understanding

Event: CVPR 2025 · Duration: 32 min · ▶ Watch on YouTube

Abstract

CrossOver is a novel multimodal model designed for 3D scene understanding, addressing limitations of existing object-level models. It introduces a flexible alignment strategy that allows training with partial modality data, focusing on pairwise alignments with images as the central modality. The model performs instance-level alignment for objects within a scene and then aggregates these features for scene-level understanding, enabling cross-modal retrieval without requiring explicit semantic segmentation. This approach demonstrates strong performance in cross-modal instance matching and scene retrieval, even in emergent cross-modal scenarios not explicitly trained on, and opens avenues for applications in VR/AR and design.

Speakers

  • Sayan Deb Sarkar — ETH Zurich
  • Ondrej Miksik — ETH Zurich
  • Marc Pollefeys — ETH Zurich
  • Dániel Béla Baráth — ETH Zurich

Talks (1)

  • 00:00:00 — Dániel Béla Baráth: Cross-Modal 3D Scene Understanding
    • This talk introduces CrossOver, a novel multimodal model for 3D scene understanding that uses a flexible alignment strategy to connect various modalities at both instance and scene levels, enabling cross-modal retrieval without explicit semantic annotations.

Key Takeaways

  • CrossOver enables flexible cross-modal alignment for 3D scenes by only requiring pairwise modality data during training, rather than complete multimodal data for every instance.
  • The model achieves strong performance in cross-modal instance matching and scene retrieval, outperforming prior methods, even in emergent cross-modal scenarios not explicitly trained on.
  • It addresses the challenge of requiring explicit semantic segmentation for scene-level understanding by leveraging raw scene data and aligning it to a unified embedding space.
  • The flexible alignment approach demonstrates robustness to missing modalities and can effectively transfer knowledge across different data types, opening avenues for applications in VR/AR and design.

Methods / Models / Datasets Mentioned

  • CLIP
  • DALL-E
  • ImageBind
  • ULIP-2
  • Point-Bind
  • BLIP
  • DINOv2
  • Point-Cloud Masked Autoencoder
  • ScanNet
  • 3RScan

Topics

Multimodal models · 3D scene understanding · Cross-modal alignment · Flexible data alignment · Instance-level understanding · Scene-level understanding · Contrastive learning · Scene retrieval · VR/AR applications


Notes

Open for commentary — connections to other work, critiques, follow-up reading.