The 3rd Monocular Depth Estimation Challenge

Event: CVPR 2024 Workshop - 3rd Monocular Depth Estimation Challenge (MDEC) · Duration: 245 min · ▶ Watch on YouTube

Abstract

This workshop presents the 3rd Monocular Depth Estimation Challenge, bringing together leading researchers to discuss the latest advancements and challenges in the field. Talks cover the historical evolution of monocular depth estimation, from early self-supervised methods to modern foundational models, and explore their applications in areas like augmented reality and automated driving. Key discussions include strategies for achieving scale-aware metric depth, leveraging multi-frame information, and the development of novel architectures like Depth Field Networks. The challenge results highlight the impact of high-quality data, effective fine-tuning strategies, and the integration of geometric priors. The workshop also delves into the importance of robust evaluation metrics and the potential of implicit learning for scene representation, pushing the boundaries of zero-shot generalization and real-time performance.

Speakers

  • Ripudaman Singh Arora — Blue River Technology
  • Matteo Poggi — University of Bologna
  • Vítor Guizilini — Toyota Research Institute
  • Eric Brachmann — Niantic
  • Mykola Lavreniuk — Team EVP++
  • Guangyuan Zhou — PICO-MR
  • Aradhye Agarwal — Indian Institute of Technology Delhi
  • James Elder — York University

Talks (8)

  • 00:00:00 — Ripudaman Singh Arora: The 3rd Monocular Depth Estimation Challenge (Introduction)
    • Introduction to the workshop, its goals, the organizing committee, and the importance of monocular depth estimation in robotics and computer vision.
  • 02:09:00Matteo Poggi: Monocular Depth Estimation: Are We Done?
    • A comprehensive overview of the evolution of monocular depth estimation, from early self-supervised methods to modern foundational models, highlighting current limitations and future challenges.
  • 05:52:00Vítor Guizilini: An ODE to MonODEpth
    • Discusses advancements in monocular depth estimation, including self-supervised methods, scale-aware metric depth, multi-frame depth, and depth field networks, highlighting challenges and future directions.
  • 09:20:00Eric Brachmann: Metric Depth for Instant AR
    • Explores the use of metric depth estimation for instant augmented reality (AR), addressing challenges like scale ambiguity, dynamic objects, and the need for robust pose estimation, introducing a new dataset and a workshop challenge.
  • 10:45:00Mykola Lavreniuk: 3rd Monocular Depth Estimation Challenge @ CVPR24 (EVP++ solution)
    • Presents the EVP++ solution for the Monocular Depth Estimation Challenge, utilizing diffusion-based models, automatic image captioning with BLIP-2, and a novel inverse multi-attentive feature alignment module for improved accuracy.
  • 10:53:00Guangyuan Zhou: High Quality Data makes great progress
    • Introduces the PICO-MR method, which leverages high-quality data selection and a refined Depth-Anything model for improved monocular depth estimation, addressing challenges in diverse scenes and camera differences.
  • 11:00:00Aradhye Agarwal: The 3rd Monocular Depth Estimation Challenge (visioniitd solution)
    • Presents the visioniitd solution for the Monocular Depth Estimation Challenge, which utilizes a ViT-based architecture with MLPs for scene embedding, aligning to CLIP space without intermediate text, and achieving strong zero-shot generalization.
  • 11:06:00James Elder: Ground Theory of Metric Monodepth
    • Proposes a ground theory approach to monocular depth estimation, leveraging semantic segmentation and geometric priors to infer depth without explicit learning, demonstrating surprisingly good metric depth maps.

Key Takeaways

  • Foundational models like Depth Anything, trained on vast datasets, significantly advance monocular depth estimation, but challenges remain with non-Lambertian surfaces and extreme viewpoints.
  • Integrating geometric priors and semantic information, even without direct depth supervision, can lead to surprisingly accurate metric depth maps and improved generalization across diverse scenes and camera models.
  • Novel architectures leveraging transformers, diffusion models, and implicit scene representations are crucial for achieving scale-aware metric depth, multi-frame consistency, and robust pose estimation in complex real-world scenarios like AR and autonomous driving.
  • The development of high-quality, diverse datasets and robust evaluation metrics, including 3D point cloud-based metrics and depth boundary metrics, is essential for fair benchmarking and driving progress in the field.
  • End-to-end differentiable pipelines that integrate feature extraction, matching, and pose optimization, combined with multi-stage curriculum learning and self-calibration techniques, show promise for overcoming limitations in traditional methods and achieving real-time performance.

Methods / Models / Datasets Mentioned

  • SfMLearner
  • Depth Anything
  • Depth Anything v2
  • Marigold
  • ChronoDepth
  • Depth4TOM
  • PackNet
  • Metric Velocity Supervision
  • Dense Depth for Automated Driving (DDAD)
  • Pseudo-Lidar
  • DD3D
  • DD3D v2
  • Self-Supervised Scene Flow
  • Tactile Sensors
  • Depth Field Networks
  • Equivariant Perceiver IO
  • DeLIRA
  • Scale-Aware Metric Depth
  • SuperGlue
  • DPT (Dense Prediction Transformer)
  • Mickey
  • LoFTR
  • RANSAC
  • Kabsch
  • ACE (Accelerated Coordinate Encoding)
  • ACE Zero
  • ACE Relocalizer
  • ACE Zero Relocalizer
  • EVP++
  • ZoeDepth
  • VPD
  • BLIP-2
  • CLIP
  • IMAFA (Inverse Multi-Attentive Feature Alignment)
  • PICO-MR
  • BEiT384-L
  • Metric3D
  • MiDaS
  • LeRes
  • ArgoVerse
  • Dsec
  • NYUv2
  • KITTI
  • Cityscapes
  • Diode
  • SGM+Lidar
  • Kinect
  • City_Tartan
  • City_KITTI
  • ViT
  • MLPs
  • Ground Theory
  • InternImage
  • ADE20K
  • MCMLSD
  • FR
  • PixelFormer
  • MIM
  • AiT
  • Elder Lab (Segmentation net + statistical models)
  • ReadingLS (SwiftDepth)
  • FRDC-SH
  • HIT-AIIA
  • 3DCreators
  • RGA-Robots
  • SuperPoint

Topics

Monocular Depth Estimation · Self-Supervised Learning · Foundational Models · Scale-Aware Metric Depth · Multi-Frame Depth Estimation · Depth Field Networks · Augmented Reality (AR) · Automated Driving · Semantic Segmentation · Geometric Priors · Zero-Shot Generalization · Benchmarking & Evaluation Metrics · Diffusion Models · Camera Self-Calibration


Notes

Open for commentary — connections to other work, critiques, follow-up reading.