Self-supervised Learning for Dynamic 3D Scene Understanding

Event: CVPR 2025 · Duration: 26 min · ▶ Watch on YouTube

Abstract

This talk explores the evolution of 3D computer vision for autonomous systems, starting from early work on autonomous cars and drones. It delves into direct visual SLAM methods like LSD SLAM and DSO for robust static world mapping. The presentation then transitions to the application of deep neural networks beyond traditional object recognition, including medical imaging and protein prediction. A significant portion is dedicated to self-supervised learning for dynamic scene understanding, showcasing methods like AnyCam for reconstructing dynamic environments from casual videos and CUPS for unsupervised panoptic segmentation in traffic. Finally, the talk highlights DeepScenario, a startup focused on 3D traffic monitoring from aerial perspectives, emphasizing the importance of understanding human behavior in complex dynamic scenes.

Speakers

Daniel Cremers — Chair of Computer Vision and AI, TU Munich, Munich Center for Machine Learning

Talks (1)

00:00:00 — Daniel Cremers: Self-supervised Learning for Dynamic 3D Scene Understanding
- Daniel Cremers presents his lab’s work on 3D computer vision for autonomous systems, covering static world mapping with direct visual SLAM, deep learning for various applications beyond object recognition, and recent advancements in self-supervised dynamic scene understanding from casual videos and aerial data for traffic monitoring.

Key Takeaways

Early work in 3D computer vision for autonomous cars and drones laid the groundwork for understanding dynamic scenes and developing robust localization and mapping systems.
Direct visual SLAM methods offer advantages over keypoint-based approaches by utilizing full image information and photometric consistency, leading to more accurate and robust 3D reconstructions.
Deep learning has revolutionized computer vision, extending beyond object recognition to tasks like medical image reconstruction, optical flow estimation, and even protein structure prediction.
Self-supervised learning on casual videos and aerial data is a promising direction for dynamic 3D scene understanding, enabling the reconstruction of complex environments and the analysis of human behavior in traffic without extensive ground truth labeling.
Understanding human driving behavior and interactions in diverse geographical contexts is crucial for the future of safe and reliable autonomous vehicles, requiring advanced 3D traffic monitoring and generative models.

Methods / Models / Datasets Mentioned

SIFT
SURF
BRIEF
LSD SLAM
Direct Sparse Odometry (DSO)
DMVIO (Delayed Marginalization VI Odometry)
AlexNet
ZF
VGG
GNet
ResNet
GNet-4
Q-Space Deep Learning
FlowNet
Flow, Stereo & Scene Flow
Protein Contact Prediction from Amino Acid Co-Evolution
Alphafold
DVSO
D3VO
MonoRec (Monocular Dense Reconstruction)
AnyCam
CUPS (Scene-Centric Unsupervised Panoptic Segmentation)
U2Seg
DeepScenario

Topics

Self-supervised learning · Dynamic 3D scene understanding · Autonomous systems · Visual SLAM · Deep learning · Monocular depth estimation · Panoptic segmentation · Traffic monitoring · Protein prediction · Rolling shutter

Notes

Open for commentary — connections to other work, critiques, follow-up reading.