The 3rd Monocular Depth Estimation Challenge

Event: CVPR 2024 Workshop - 3rd Monocular Depth Estimation Challenge (MDEC) · Duration: 245 min · ▶ Watch on YouTube

Abstract

This workshop presents the 3rd Monocular Depth Estimation Challenge, bringing together leading researchers to discuss the latest advancements and challenges in the field. Talks cover the historical evolution of monocular depth estimation, from early self-supervised methods to modern foundational models, and explore their applications in areas like augmented reality and automated driving. Key discussions include strategies for achieving scale-aware metric depth, leveraging multi-frame information, and the development of novel architectures like Depth Field Networks. The challenge results highlight the impact of high-quality data, effective fine-tuning strategies, and the integration of geometric priors. The workshop also delves into the importance of robust evaluation metrics and the potential of implicit learning for scene representation, pushing the boundaries of zero-shot generalization and real-time performance.

Speakers

Ripudaman Singh Arora — Blue River Technology
Matteo Poggi — University of Bologna
Vítor Guizilini — Toyota Research Institute
Eric Brachmann — Niantic
Mykola Lavreniuk — Team EVP++
Guangyuan Zhou — PICO-MR
Aradhye Agarwal — Indian Institute of Technology Delhi
James Elder — York University

Talks (8)

00:00:00 — Ripudaman Singh Arora: The 3rd Monocular Depth Estimation Challenge (Introduction)
- Introduction to the workshop, its goals, the organizing committee, and the importance of monocular depth estimation in robotics and computer vision.
02:09:00 — Matteo Poggi: Monocular Depth Estimation: Are We Done?
- A comprehensive overview of the evolution of monocular depth estimation, from early self-supervised methods to modern foundational models, highlighting current limitations and future challenges.
05:52:00 — Vítor Guizilini: An ODE to MonODEpth
- Discusses advancements in monocular depth estimation, including self-supervised methods, scale-aware metric depth, multi-frame depth, and depth field networks, highlighting challenges and future directions.
09:20:00 — Eric Brachmann: Metric Depth for Instant AR
- Explores the use of metric depth estimation for instant augmented reality (AR), addressing challenges like scale ambiguity, dynamic objects, and the need for robust pose estimation, introducing a new dataset and a workshop challenge.
10:45:00 — Mykola Lavreniuk: 3rd Monocular Depth Estimation Challenge @ CVPR24 (EVP++ solution)
- Presents the EVP++ solution for the Monocular Depth Estimation Challenge, utilizing diffusion-based models, automatic image captioning with BLIP-2, and a novel inverse multi-attentive feature alignment module for improved accuracy.
10:53:00 — Guangyuan Zhou: High Quality Data makes great progress
- Introduces the PICO-MR method, which leverages high-quality data selection and a refined Depth-Anything model for improved monocular depth estimation, addressing challenges in diverse scenes and camera differences.
11:00:00 — Aradhye Agarwal: The 3rd Monocular Depth Estimation Challenge (visioniitd solution)
- Presents the visioniitd solution for the Monocular Depth Estimation Challenge, which utilizes a ViT-based architecture with MLPs for scene embedding, aligning to CLIP space without intermediate text, and achieving strong zero-shot generalization.
11:06:00 — James Elder: Ground Theory of Metric Monodepth
- Proposes a ground theory approach to monocular depth estimation, leveraging semantic segmentation and geometric priors to infer depth without explicit learning, demonstrating surprisingly good metric depth maps.

Key Takeaways

Foundational models like Depth Anything, trained on vast datasets, significantly advance monocular depth estimation, but challenges remain with non-Lambertian surfaces and extreme viewpoints.
Integrating geometric priors and semantic information, even without direct depth supervision, can lead to surprisingly accurate metric depth maps and improved generalization across diverse scenes and camera models.
Novel architectures leveraging transformers, diffusion models, and implicit scene representations are crucial for achieving scale-aware metric depth, multi-frame consistency, and robust pose estimation in complex real-world scenarios like AR and autonomous driving.
The development of high-quality, diverse datasets and robust evaluation metrics, including 3D point cloud-based metrics and depth boundary metrics, is essential for fair benchmarking and driving progress in the field.
End-to-end differentiable pipelines that integrate feature extraction, matching, and pose optimization, combined with multi-stage curriculum learning and self-calibration techniques, show promise for overcoming limitations in traditional methods and achieving real-time performance.

Methods / Models / Datasets Mentioned

SfMLearner
Depth Anything
Depth Anything v2
Marigold
ChronoDepth
Depth4TOM
PackNet
Metric Velocity Supervision
Dense Depth for Automated Driving (DDAD)
Pseudo-Lidar
DD3D
DD3D v2
Self-Supervised Scene Flow
Tactile Sensors
Depth Field Networks
Equivariant Perceiver IO
DeLIRA
Scale-Aware Metric Depth
SuperGlue
DPT (Dense Prediction Transformer)
Mickey
LoFTR
RANSAC
Kabsch
ACE (Accelerated Coordinate Encoding)
ACE Zero
ACE Relocalizer
ACE Zero Relocalizer
EVP++
ZoeDepth
VPD
BLIP-2
CLIP
IMAFA (Inverse Multi-Attentive Feature Alignment)
PICO-MR
BEiT384-L
Metric3D
MiDaS
LeRes
ArgoVerse
Dsec
NYUv2
KITTI
Cityscapes
Diode
SGM+Lidar
Kinect
City_Tartan
City_KITTI
ViT
MLPs
Ground Theory
InternImage
ADE20K
MCMLSD
FR
PixelFormer
MIM
AiT
Elder Lab (Segmentation net + statistical models)
ReadingLS (SwiftDepth)
FRDC-SH
HIT-AIIA
3DCreators
RGA-Robots
SuperPoint

Topics

Monocular Depth Estimation · Self-Supervised Learning · Foundational Models · Scale-Aware Metric Depth · Multi-Frame Depth Estimation · Depth Field Networks · Augmented Reality (AR) · Automated Driving · Semantic Segmentation · Geometric Priors · Zero-Shot Generalization · Benchmarking & Evaluation Metrics · Diffusion Models · Camera Self-Calibration

Notes

Open for commentary — connections to other work, critiques, follow-up reading.