Image Matching: Local Features and Beyond
Event: CVPR 2024 Workshop · Duration: 257 min · ▶ Watch on YouTube
Abstract
The CVPR 2024 Image Matching Workshop, “Local Features and Beyond,” brought together researchers to discuss the latest advancements and challenges in image matching and 3D reconstruction. The workshop featured invited talks, paper presentations, and a Kaggle challenge (Hexathlon) focused on robustly estimating camera poses and 3D scene structure under various challenging conditions, including symmetries, transparent objects, and natural environments. Key themes included the integration of deep learning with traditional geometric methods, the importance of robust feature matching, and the ongoing quest for accurate and scalable 3D reconstruction from diverse image collections. The event highlighted the community’s efforts to push beyond conventional benchmarks and address real-world complexities in computer vision.
Speakers
- Dmitry Mishkin — CTU Prague/HOVER Inc.
- Fabio Bellavia — Univ. Palermo
- Jiri Matas — CTU Prague
- Luca Morelli — U. Trento/BFK
- Fabio Remondino — Bruno Kessler Foundation
- Weiwei Sun — U. British Columbia
- Amy Tabb — USDA-ARS-AFRS
- Eduard Trulls — Google
- Kwang Moo Yi — U. British Columbia
- Noah Snavely — Cornell Tech & Google DeepMind
- Juan Tardós — Universidad de Zaragoza
- Vincent Leroy — NAVER labs
- Johan Edstedt — CVL, Linköping University
- Georg Bökman — Chalmers University of Technology
- Zhenjun Zhao — Chinese University of Hong Kong / Texas A&M University
- Hongkai Chen — Apple Inc. / HKUST
- Zixin Luo — Apple Inc.
- Yurun Tian — Apple Inc.
- Xuyang Bai — Apple Inc.
- Ziyu Wang — Apple Inc.
- Lei Zhou — Apple Inc.
- Mingmin Zhen — Apple Inc.
- Tian Fang — Apple Inc.
- David McKinnon — Apple Inc.
- Yanghai Tsin — Apple Inc.
- Long Quan — HKUST
- Gabriele Berton — Politecnico di Torino
- Gabriele Goletto — Politecnico di Torino
- Gabriele Trivigno — Politecnico di Torino
- Alex Stoken — NASA Johnson Space Center
- Barbara Caputo — Politecnico di Torino
- Carlo Masone — Politecnico di Torino
- Önder Tuzcuoğlu — METU Center for Image Analysis
- Aybora Köksal — METU Center for Image Analysis
- Buğra Sofu — METU Center for Image Analysis
- Sinan Kalkan — METU Center for Image Analysis
- A. Aydın Alatan — METU Center for Image Analysis
- Amulya Pendota — Lab For Video and Image Analysis (LFOVIA), IIT Hyderabad
- Sumohana S. Channappayya — Lab For Video and Image Analysis (LFOVIA), IIT Hyderabad
- Fabio Bellavia — Univ. Palermo
- Vladislav Ostankovich — ITMO University
- Yuki Kashiwaba — Iterra Solutions Inc.
- Ammar Ali — ITMO University
- Igor Lashkov — University of Hawaii
- Jaafar Mahmud — ITMO University
- Hao Yu (ZJU3DV) — Zhejiang University
- Jianyuan Wang — Visual Geometry Group, University of Oxford
- Minghao Chen — Meta AI
- Christian Rupprecht — Meta AI
- David Novotny — Meta AI
- Motonobu Hommi — Lumada Data Science Lab., Hitachi, Ltd.
Talks (16)
- 00:00:00 — Dmitry Mishkin: Image Matching: Local Features and Beyond (CVPR 2024 Workshop)
- Introduction to the workshop, its organizers, sponsors, agenda, history, and the motivation behind focusing on image matching challenges.
- 00:04:00 — Noah Snavely: MegaScenes: Reconstructing All of the World’s Landmarks
- Presentation on the MegaScenes dataset and methods for large-scale 3D reconstruction of world landmarks from internet photos, highlighting challenges with symmetries and the need for robust feature matching.
- 00:57:30 — Juan Tardós: Visual SLAM inside the human body
- Discussion on the challenges of applying Visual SLAM techniques inside the human body due to non-rigid deformations, poor texture, illumination changes, and monocular endoscopes, and presenting solutions using deformable tracking and neural reconstruction.
- 01:42:20 — Vincent Leroy: From DUST3R to MAST3R Stereo 3D Reconstruction
- Introduction to Dust3r, a data-driven stereo 3D reconstruction method that predicts point maps from image pairs, and its evolution to MAST3R, which incorporates explicit matching for improved accuracy and robustness.
- 02:14:10 — Johan Edstedt: DeDoDe v2: Analyzing and Improving the DeDoDe Keypoint Detector
- Presentation of DeDoDe v2, an improved keypoint detector that addresses limitations of v1 by incorporating shorter training times, better regularization, and top-k per image NMS, leading to significant quantitative improvements on various benchmarks.
- 02:29:10 — Hongkai Chen: Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching
- Introduction of AffineFormer, a semi-dense matching method that uses affine-based deformable attention and selective global-local message fusion to achieve sub-pixel accuracy and robustness against large viewpoint changes.
- 02:41:50 — Gabriele Berton: EarthMatch: Iterative Coregistration for Fine-grained Localization of Astronaut Photography
- Presentation of EarthMatch, a method for fine-grained localization of astronaut photography by iteratively coregistering query images with a database of satellite images, achieving high confidence and pixel-wise localization.
- 02:52:50 — Önder Tuzcuoğlu: XoFTR: Cross-modal Feature Matching Transformer
- Introduction of XoFTR, a cross-modal feature matching transformer that leverages two-stage training with masked image modeling and pseudo-thermal data augmentation to achieve state-of-the-art performance on visible-thermal image matching benchmarks.
- 03:02:30 — Amulya Pendota: Are Deep Learning Models Pre-trained on RGB Data Good Enough for RGB-Thermal Image Retrieval?
- Evaluation of various RGB pre-trained models for RGB-Thermal image retrieval, highlighting the challenges of modality inconsistency and the need for task-specific datasets, while demonstrating that some models can achieve good cross-domain generalization.
- 03:15:30 — Fabio Bellavia: Image Matching Challenge 2024 - Hexathlon
- Overview of the 2024 Image Matching Challenge (Hexathlon) on Kaggle, introducing new categories like transparent objects and natural environments, and a new evaluation metric based on camera centers.
- 03:26:40 — Vladislav Ostankovich: Image Matching Challenge 2024 - Hexathlon
- Presentation of the 1st place solution for the Image Matching Challenge 2024, focusing on a multi-stage matching pipeline for general and transparent scenes, leveraging rotation detection, feature extraction, and 3D reconstruction techniques.
- 03:37:00 — Hao Yu (ZJU3DV): 6th ZJU3DV presentation
- Presentation of the 6th place solution for the Image Matching Challenge 2024, focusing on a robust and accurate approach for general and transparent scenes, leveraging multi-stage matching, iterative refinement, and feature track refinement.
- 03:46:20 — Jianyuan Wang: 3rd Place Solution in Image Matching Challenge 2024: VGGSfM
- Presentation of the 3rd place solution for the Image Matching Challenge 2024, detailing the VGGSfM framework for differentiable SfM, its integration into a COLMAP pipeline for improved accuracy, and lessons learned regarding GPU memory constraints on Kaggle.
- 04:09:08 — Motonobu Hommi: 8th place solution in Image Matching Challenge 2024
- Presentation of the 8th place solution for the Image Matching Challenge 2024, outlining a pipeline that combines image retrieval, multi-stage matching, and reconstruction with COLMAP using both simple-radial and simple-pinhole camera models.
- 04:14:54 — Dmitry Mishkin: What next?
- Concluding remarks for the workshop, discussing future directions for image matching research, including improving error metrics, scaling to larger datasets, moving to open benchmarks, and fostering community collaboration.
Key Takeaways
- Image matching and 3D reconstruction are still active research areas with significant challenges, especially in unconstrained real-world scenarios.
- Deep learning methods are increasingly integrated with classical geometric approaches, often outperforming traditional techniques, but also introducing new challenges like memory constraints and generalization.
- The community is moving towards more complex and realistic benchmarks, such as the Hexathlon challenge, which includes diverse categories like transparent objects, natural environments, and temporal changes.
- Open data and code repositories are crucial for fostering collaboration and accelerating progress in the field.
- Future directions involve developing more robust and accurate methods for handling symmetries, occlusions, illumination changes, and non-rigid deformations, as well as exploring holistic approaches that integrate multimodal data and leverage foundational models.
Methods / Models / Datasets Mentioned
ColmapNerfGaussian SplattingDusterVGGSfMAceZeroGoogle Live ViewSIFTLoFTRDINO-ViTRoMaOmniGlueXFeatKeyNetAffHardNetDISKPatchmatchNetGeoMVSNetDPT-KITTISuperPointSuperGlueALIKEDLightGlueNetVLADPatchNetVLADMixVPRR2FormerSGMResNet-18ResNet-34ResNet-50ResNet-101ResNet-152SqueezeNetVGG16AlexNetTokenCutDBSCANHorn alignmentORB-SLAM3NR-SLAMCudaSIFT-SLAMLightDepthNeusLightNeusAffineFormerTTA (Test-Time Augmentation)Rot90 (Rotation by 90 degrees)tf-efficientnet-b7TSP (Traveling Salesman Problem)SIMM (Image Similarity Matrix)DeepSfMPoseDiffusionPixSfMDeep Point Tracker2D CNN (in VGGSfM)Transformer (in VGGSfM)Bundle Adjustment (in VGGSfM)Multi-view Feature Transformer (in ZJU3DV)Multi-view Matcher (in ZJU3DV)Multi-view Correlation (in ZJU3DV)Cost Volume (in ZJU3DV)
Topics
Image Matching · 3D Reconstruction · Local Features · Structure from Motion (SfM) · Deep Learning for Vision · Multi-view Geometry · Pose Estimation · Dataset Challenges · Robustness in Vision · Cross-modal Matching
Notes
Open for commentary — connections to other work, critiques, follow-up reading.