Image Matching: Local Features and Beyond

Event: CVPR 2024 Workshop · Duration: 257 min · ▶ Watch on YouTube

Abstract

The CVPR 2024 Image Matching Workshop, “Local Features and Beyond,” brought together researchers to discuss the latest advancements and challenges in image matching and 3D reconstruction. The workshop featured invited talks, paper presentations, and a Kaggle challenge (Hexathlon) focused on robustly estimating camera poses and 3D scene structure under various challenging conditions, including symmetries, transparent objects, and natural environments. Key themes included the integration of deep learning with traditional geometric methods, the importance of robust feature matching, and the ongoing quest for accurate and scalable 3D reconstruction from diverse image collections. The event highlighted the community’s efforts to push beyond conventional benchmarks and address real-world complexities in computer vision.

Speakers

Dmitry Mishkin — CTU Prague/HOVER Inc.
Fabio Bellavia — Univ. Palermo
Jiri Matas — CTU Prague
Luca Morelli — U. Trento/BFK
Fabio Remondino — Bruno Kessler Foundation
Weiwei Sun — U. British Columbia
Amy Tabb — USDA-ARS-AFRS
Eduard Trulls — Google
Kwang Moo Yi — U. British Columbia
Noah Snavely — Cornell Tech & Google DeepMind
Juan Tardós — Universidad de Zaragoza
Vincent Leroy — NAVER labs
Johan Edstedt — CVL, Linköping University
Georg Bökman — Chalmers University of Technology
Zhenjun Zhao — Chinese University of Hong Kong / Texas A&M University
Hongkai Chen — Apple Inc. / HKUST
Zixin Luo — Apple Inc.
Yurun Tian — Apple Inc.
Xuyang Bai — Apple Inc.
Ziyu Wang — Apple Inc.
Lei Zhou — Apple Inc.
Mingmin Zhen — Apple Inc.
Tian Fang — Apple Inc.
David McKinnon — Apple Inc.
Yanghai Tsin — Apple Inc.
Long Quan — HKUST
Gabriele Berton — Politecnico di Torino
Gabriele Goletto — Politecnico di Torino
Gabriele Trivigno — Politecnico di Torino
Alex Stoken — NASA Johnson Space Center
Barbara Caputo — Politecnico di Torino
Carlo Masone — Politecnico di Torino
Önder Tuzcuoğlu — METU Center for Image Analysis
Aybora Köksal — METU Center for Image Analysis
Buğra Sofu — METU Center for Image Analysis
Sinan Kalkan — METU Center for Image Analysis
A. Aydın Alatan — METU Center for Image Analysis
Amulya Pendota — Lab For Video and Image Analysis (LFOVIA), IIT Hyderabad
Sumohana S. Channappayya — Lab For Video and Image Analysis (LFOVIA), IIT Hyderabad
Fabio Bellavia — Univ. Palermo
Vladislav Ostankovich — ITMO University
Yuki Kashiwaba — Iterra Solutions Inc.
Ammar Ali — ITMO University
Igor Lashkov — University of Hawaii
Jaafar Mahmud — ITMO University
Hao Yu (ZJU3DV) — Zhejiang University
Jianyuan Wang — Visual Geometry Group, University of Oxford
Minghao Chen — Meta AI
Christian Rupprecht — Meta AI
David Novotny — Meta AI
Motonobu Hommi — Lumada Data Science Lab., Hitachi, Ltd.

Talks (16)

00:00:00 — Dmitry Mishkin: Image Matching: Local Features and Beyond (CVPR 2024 Workshop)
- Introduction to the workshop, its organizers, sponsors, agenda, history, and the motivation behind focusing on image matching challenges.
00:04:00 — Noah Snavely: MegaScenes: Reconstructing All of the World’s Landmarks
- Presentation on the MegaScenes dataset and methods for large-scale 3D reconstruction of world landmarks from internet photos, highlighting challenges with symmetries and the need for robust feature matching.
00:57:30 — Juan Tardós: Visual SLAM inside the human body
- Discussion on the challenges of applying Visual SLAM techniques inside the human body due to non-rigid deformations, poor texture, illumination changes, and monocular endoscopes, and presenting solutions using deformable tracking and neural reconstruction.
01:42:20 — Vincent Leroy: From DUST3R to MAST3R Stereo 3D Reconstruction
- Introduction to Dust3r, a data-driven stereo 3D reconstruction method that predicts point maps from image pairs, and its evolution to MAST3R, which incorporates explicit matching for improved accuracy and robustness.
02:14:10 — Johan Edstedt: DeDoDe v2: Analyzing and Improving the DeDoDe Keypoint Detector
- Presentation of DeDoDe v2, an improved keypoint detector that addresses limitations of v1 by incorporating shorter training times, better regularization, and top-k per image NMS, leading to significant quantitative improvements on various benchmarks.
02:29:10 — Hongkai Chen: Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching
- Introduction of AffineFormer, a semi-dense matching method that uses affine-based deformable attention and selective global-local message fusion to achieve sub-pixel accuracy and robustness against large viewpoint changes.
02:41:50 — Gabriele Berton: EarthMatch: Iterative Coregistration for Fine-grained Localization of Astronaut Photography
- Presentation of EarthMatch, a method for fine-grained localization of astronaut photography by iteratively coregistering query images with a database of satellite images, achieving high confidence and pixel-wise localization.
02:52:50 — Önder Tuzcuoğlu: XoFTR: Cross-modal Feature Matching Transformer
- Introduction of XoFTR, a cross-modal feature matching transformer that leverages two-stage training with masked image modeling and pseudo-thermal data augmentation to achieve state-of-the-art performance on visible-thermal image matching benchmarks.
03:02:30 — Amulya Pendota: Are Deep Learning Models Pre-trained on RGB Data Good Enough for RGB-Thermal Image Retrieval?
- Evaluation of various RGB pre-trained models for RGB-Thermal image retrieval, highlighting the challenges of modality inconsistency and the need for task-specific datasets, while demonstrating that some models can achieve good cross-domain generalization.
03:15:30 — Fabio Bellavia: Image Matching Challenge 2024 - Hexathlon
- Overview of the 2024 Image Matching Challenge (Hexathlon) on Kaggle, introducing new categories like transparent objects and natural environments, and a new evaluation metric based on camera centers.
03:26:40 — Vladislav Ostankovich: Image Matching Challenge 2024 - Hexathlon
- Presentation of the 1st place solution for the Image Matching Challenge 2024, focusing on a multi-stage matching pipeline for general and transparent scenes, leveraging rotation detection, feature extraction, and 3D reconstruction techniques.
03:37:00 — Hao Yu (ZJU3DV): 6th ZJU3DV presentation
- Presentation of the 6th place solution for the Image Matching Challenge 2024, focusing on a robust and accurate approach for general and transparent scenes, leveraging multi-stage matching, iterative refinement, and feature track refinement.
03:46:20 — Jianyuan Wang: 3rd Place Solution in Image Matching Challenge 2024: VGGSfM
- Presentation of the 3rd place solution for the Image Matching Challenge 2024, detailing the VGGSfM framework for differentiable SfM, its integration into a COLMAP pipeline for improved accuracy, and lessons learned regarding GPU memory constraints on Kaggle.
04:09:08 — Motonobu Hommi: 8th place solution in Image Matching Challenge 2024
- Presentation of the 8th place solution for the Image Matching Challenge 2024, outlining a pipeline that combines image retrieval, multi-stage matching, and reconstruction with COLMAP using both simple-radial and simple-pinhole camera models.
04:14:54 — Dmitry Mishkin: What next?
- Concluding remarks for the workshop, discussing future directions for image matching research, including improving error metrics, scaling to larger datasets, moving to open benchmarks, and fostering community collaboration.

Key Takeaways

Image matching and 3D reconstruction are still active research areas with significant challenges, especially in unconstrained real-world scenarios.
Deep learning methods are increasingly integrated with classical geometric approaches, often outperforming traditional techniques, but also introducing new challenges like memory constraints and generalization.
The community is moving towards more complex and realistic benchmarks, such as the Hexathlon challenge, which includes diverse categories like transparent objects, natural environments, and temporal changes.
Open data and code repositories are crucial for fostering collaboration and accelerating progress in the field.
Future directions involve developing more robust and accurate methods for handling symmetries, occlusions, illumination changes, and non-rigid deformations, as well as exploring holistic approaches that integrate multimodal data and leverage foundational models.

Methods / Models / Datasets Mentioned

Colmap
Nerf
Gaussian Splatting
Duster
VGGSfM
AceZero
Google Live View
SIFT
LoFTR
DINO-ViT
RoMa
OmniGlue
XFeat
KeyNetAffHardNet
DISK
PatchmatchNet
GeoMVSNet
DPT-KITTI
SuperPoint
SuperGlue
ALIKED
LightGlue
NetVLAD
PatchNetVLAD
MixVPR
R2Former
SGM
ResNet-18
ResNet-34
ResNet-50
ResNet-101
ResNet-152
SqueezeNet
VGG16
AlexNet
TokenCut
DBSCAN
Horn alignment
ORB-SLAM3
NR-SLAM
CudaSIFT-SLAM
LightDepth
Neus
LightNeus
AffineFormer
TTA (Test-Time Augmentation)
Rot90 (Rotation by 90 degrees)
tf-efficientnet-b7
TSP (Traveling Salesman Problem)
SIMM (Image Similarity Matrix)
DeepSfM
PoseDiffusion
PixSfM
Deep Point Tracker
2D CNN (in VGGSfM)
Transformer (in VGGSfM)
Bundle Adjustment (in VGGSfM)
Multi-view Feature Transformer (in ZJU3DV)
Multi-view Matcher (in ZJU3DV)
Multi-view Correlation (in ZJU3DV)
Cost Volume (in ZJU3DV)

Topics

Image Matching · 3D Reconstruction · Local Features · Structure from Motion (SfM) · Deep Learning for Vision · Multi-view Geometry · Pose Estimation · Dataset Challenges · Robustness in Vision · Cross-modal Matching

Notes

Open for commentary — connections to other work, critiques, follow-up reading.