Cross-Modal 3D Scene Understanding
Event: CVPR 2025 · Duration: 32 min · ▶ Watch on YouTube
Abstract
CrossOver is a novel multimodal model designed for 3D scene understanding, addressing limitations of existing object-level models. It introduces a flexible alignment strategy that allows training with partial modality data, focusing on pairwise alignments with images as the central modality. The model performs instance-level alignment for objects within a scene and then aggregates these features for scene-level understanding, enabling cross-modal retrieval without requiring explicit semantic segmentation. This approach demonstrates strong performance in cross-modal instance matching and scene retrieval, even in emergent cross-modal scenarios not explicitly trained on, and opens avenues for applications in VR/AR and design.
Speakers
- Sayan Deb Sarkar — ETH Zurich
- Ondrej Miksik — ETH Zurich
- Marc Pollefeys — ETH Zurich
- Dániel Béla Baráth — ETH Zurich
Talks (1)
- 00:00:00 — Dániel Béla Baráth: Cross-Modal 3D Scene Understanding
- This talk introduces CrossOver, a novel multimodal model for 3D scene understanding that uses a flexible alignment strategy to connect various modalities at both instance and scene levels, enabling cross-modal retrieval without explicit semantic annotations.
Key Takeaways
- CrossOver enables flexible cross-modal alignment for 3D scenes by only requiring pairwise modality data during training, rather than complete multimodal data for every instance.
- The model achieves strong performance in cross-modal instance matching and scene retrieval, outperforming prior methods, even in emergent cross-modal scenarios not explicitly trained on.
- It addresses the challenge of requiring explicit semantic segmentation for scene-level understanding by leveraging raw scene data and aligning it to a unified embedding space.
- The flexible alignment approach demonstrates robustness to missing modalities and can effectively transfer knowledge across different data types, opening avenues for applications in VR/AR and design.
Methods / Models / Datasets Mentioned
CLIPDALL-EImageBindULIP-2Point-BindBLIPDINOv2Point-Cloud Masked AutoencoderScanNet3RScan
Topics
Multimodal models · 3D scene understanding · Cross-modal alignment · Flexible data alignment · Instance-level understanding · Scene-level understanding · Contrastive learning · Scene retrieval · VR/AR applications
Notes
Open for commentary — connections to other work, critiques, follow-up reading.