Foundational Few-Shot Object Detection Challenge
Event: CVPR 2024 Challenge: Visual Perception via Learning in an Open World Workshop · Duration: 206 min · ▶ Watch on YouTube
Abstract
The CVPR 2024 Workshop on Visual Perception via Learning in an Open World (VPLOW) hosted the Foundational Few-Shot Object Detection (FSOD) Challenge, the Open Vocabulary Part Segmentation (OV-PARTS) Challenge, and the V3Det Challenge. This workshop session provided an overview of these challenges, presented winning solutions, and discussed community insights. Key topics included addressing limitations in current VLM-based FSOD benchmarking protocols, developing methods for generalized zero-shot part segmentation, and leveraging large-scale pre-training and multi-modal models for object detection in vast vocabularies.
Speakers
- Neehar Peri — Carnegie Mellon University
- Xuejian Gou — Xidian University
- Yunhan Yang — Zhejiang University
- Lingchen Meng — Fudan University
Talks (4)
- 00:00:00 — Neehar Peri: Foundational Few-Shot Object Detection Challenge
- Introduction to the Foundational Few-Shot Object Detection (FSOD) Challenge, highlighting the limitations of current VLM-based FSOD benchmarking protocols and proposing new approaches.
- 00:32:00 — Xuejian Gou: Open Vocabulary Part Segmentation Challenge @ CVPR 2024 (OV-PARTS)
- Presentation of the OV-PARTS challenge, focusing on generalized zero-shot part segmentation and cross-dataset part segmentation, and introducing the PartCLIPSeg method.
- 00:33:50 — Yunhan Yang: OV-PARTS Challenge 2024
- Presentation of the 3D-SAP method for multi-granularity 3D part segmentation, including large-scale pre-training, sample-specific fine-tuning, and semantic querying with MLLMs.
- 00:34:50 — Lingchen Meng: RichSem-DINO-FocalNet for V3Det Challenge 2024
- Presentation of the RichSem-DINO-FocalNet method, achieving first place in the OVD track and second place in the Supervised track of the V3Det Challenge 2024.
Key Takeaways
- Foundational Few-Shot Object Detection (FSOD) challenges current VLM-based benchmarking protocols due to concept alignment issues between pre-training and target domains.
- Open Vocabulary Part Segmentation (OV-PARTS) requires models to generalize across different part granularities and handle novel objects without extensive retraining, often struggling with data scarcity.
- The V3Det Challenge highlights the need for robust object detection in vast vocabularies and high-resolution images, where traditional methods show limited improvements.
- Multi-modal chat assistants and prompt tuning can significantly improve performance in object detection by leveraging external knowledge and refining concept alignment.
- Integrating external knowledge and leveraging diverse generative models are crucial for advancing comprehensive perception, understanding, reasoning, acting, and planning in open-world scenarios.
Methods / Models / Datasets Mentioned
GroundingDINOMQ-GLIPCo-DETRMixPLRichSem-DINO-FocalNetFocalNet-HugeGPT-4V(ision)Segment Anything Model (SAM)PartCLIPSegCLIPCLIPSegCAT-SegGroundingDINO3D-SAPDINOv2Point Transformer V3 (PTv3)MLLMCascade R-CNN EVA-CLIP
Topics
Foundational Few-Shot Object Detection · Open Vocabulary Part Segmentation · V3Det Challenge · Vision-Language Models · Zero-Shot Learning · Multi-Modal Prompting · 3D Part Segmentation · Object Detection
Notes
Open for commentary — connections to other work, critiques, follow-up reading.