Foundational Few-Shot Object Detection Challenge

Event: CVPR 2024 Challenge: Visual Perception via Learning in an Open World Workshop · Duration: 206 min · ▶ Watch on YouTube

Abstract

The CVPR 2024 Workshop on Visual Perception via Learning in an Open World (VPLOW) hosted the Foundational Few-Shot Object Detection (FSOD) Challenge, the Open Vocabulary Part Segmentation (OV-PARTS) Challenge, and the V3Det Challenge. This workshop session provided an overview of these challenges, presented winning solutions, and discussed community insights. Key topics included addressing limitations in current VLM-based FSOD benchmarking protocols, developing methods for generalized zero-shot part segmentation, and leveraging large-scale pre-training and multi-modal models for object detection in vast vocabularies.

Speakers

Neehar Peri — Carnegie Mellon University
Xuejian Gou — Xidian University
Yunhan Yang — Zhejiang University
Lingchen Meng — Fudan University

Talks (4)

00:00:00 — Neehar Peri: Foundational Few-Shot Object Detection Challenge
- Introduction to the Foundational Few-Shot Object Detection (FSOD) Challenge, highlighting the limitations of current VLM-based FSOD benchmarking protocols and proposing new approaches.
00:32:00 — Xuejian Gou: Open Vocabulary Part Segmentation Challenge @ CVPR 2024 (OV-PARTS)
- Presentation of the OV-PARTS challenge, focusing on generalized zero-shot part segmentation and cross-dataset part segmentation, and introducing the PartCLIPSeg method.
00:33:50 — Yunhan Yang: OV-PARTS Challenge 2024
- Presentation of the 3D-SAP method for multi-granularity 3D part segmentation, including large-scale pre-training, sample-specific fine-tuning, and semantic querying with MLLMs.
00:34:50 — Lingchen Meng: RichSem-DINO-FocalNet for V3Det Challenge 2024
- Presentation of the RichSem-DINO-FocalNet method, achieving first place in the OVD track and second place in the Supervised track of the V3Det Challenge 2024.

Key Takeaways

Foundational Few-Shot Object Detection (FSOD) challenges current VLM-based benchmarking protocols due to concept alignment issues between pre-training and target domains.
Open Vocabulary Part Segmentation (OV-PARTS) requires models to generalize across different part granularities and handle novel objects without extensive retraining, often struggling with data scarcity.
The V3Det Challenge highlights the need for robust object detection in vast vocabularies and high-resolution images, where traditional methods show limited improvements.
Multi-modal chat assistants and prompt tuning can significantly improve performance in object detection by leveraging external knowledge and refining concept alignment.
Integrating external knowledge and leveraging diverse generative models are crucial for advancing comprehensive perception, understanding, reasoning, acting, and planning in open-world scenarios.

Methods / Models / Datasets Mentioned

GroundingDINO
MQ-GLIP
Co-DETR
MixPL
RichSem-DINO-FocalNet
FocalNet-Huge
GPT-4V(ision)
Segment Anything Model (SAM)
PartCLIPSeg
CLIP
CLIPSeg
CAT-Seg
GroundingDINO
3D-SAP
DINOv2
Point Transformer V3 (PTv3)
MLLM
Cascade R-CNN EVA-CLIP

Topics

Foundational Few-Shot Object Detection · Open Vocabulary Part Segmentation · V3Det Challenge · Vision-Language Models · Zero-Shot Learning · Multi-Modal Prompting · 3D Part Segmentation · Object Detection

Notes

Open for commentary — connections to other work, critiques, follow-up reading.