ICCVW 2023 VLAR Session 3
Event: ICCVW 2023 VLAR · Duration: 28 min · ▶ Watch on YouTube
Abstract
This workshop session features three presentations on cutting-edge research in vision and language. The first talk introduces Uni-NLX, a unified model for generating human-friendly textual explanations across diverse vision and vision-language tasks. The second presentation, Le-RNR-Map, explores the creation of explicit map representations for embodied AI, enabling natural language querying and navigation using Neural Radiance Fields. Finally, CLIP-Decoder is presented as a novel approach to zero-shot multi-label classification, leveraging multimodal CLIP-aligned representations for improved performance on unseen categories.
Speakers
- Fawaz Sammani — Vrije Universiteit Brussel (VUB), imec
- Nikos Deligiannis — Vrije Universiteit Brussel (VUB), imec
- Francesco Taioli — University of Verona
- Federico Cunico — University of Verona
- Federico Girella — University of Verona
- Riccardo Bologna — University of Verona
- Alessandro Farinelli — University of Verona
- Marco Cristani — University of Verona
- Muhammad Ali — Mohamed Bin Zayed University of Artificial Intelligence
- Salman Khan — Mohamed Bin Zayed University of Artificial Intelligence
Talks (3)
- 00:00:46 — Fawaz Sammani: Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
- This paper introduces Uni-NLX, a single compact model that unifies seven different vision and vision-language tasks by generating human-friendly textual explanations, outperforming task-specific models with fewer parameters.
- 00:01:20 — Francesco Taioli: Language-enhanced RNR-Map: Querying Renderable Neural Radiance Field maps with natural language
- This paper presents Le-RNR-Map, a system that creates explicit map representations from RGB-D data and allows for natural language querying and navigation within these maps, leveraging NeRF reconstruction and negative prompts for improved object localization.
- 00:04:53 — Muhammad Ali: CLIP-Decoder: ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representations
- This paper addresses zero-shot multi-label classification by proposing CLIP-Decoder, a model that leverages multimodal CLIP-aligned representations and custom templates to effectively classify multiple unseen labels in images, outperforming existing methods.
Key Takeaways
- Textual explanations provide more detailed and human-understandable insights into model predictions compared to traditional visual explanations like heatmaps.
- Unified models like Uni-NLX can achieve competitive performance across multiple vision and vision-language tasks with significantly fewer parameters, promoting efficiency and knowledge transfer.
- Large Language Models are powerful tools for generating synthetic, rich explanation datasets, which can be used to train models for complex tasks without extensive human annotation.
- Explicit map representations, enhanced with multimodal features and NeRF reconstruction, enable advanced capabilities for embodied AI agents, including natural language-driven object search and navigation.
- Negative prompts and custom templates can be effectively integrated into vision-language models to improve localization accuracy and zero-shot classification performance by guiding the model’s attention and leveraging fine-grained feature descriptions.
Methods / Models / Datasets Mentioned
Uni-NLXGradCAMNLX-GPTCLIPCuPLLaFterImageNetXVQA-ParaXT-SNELe-RNR-MapNeRFGenerative Scene Networks (GSNs)CLIP-DecoderTResNetML-DecoderGAP-SDLLESABIAMSDMNUS-WIDE Dataset
Topics
Textual Explanations · Vision-Language Tasks · Explainable AI (XAI) · Unified Models · Large Language Models (LLMs) · Zero-shot Learning · Embodied AI · Neural Radiance Fields (NeRF) · Object Localization · Multi-label Classification
Notes
Open for commentary — connections to other work, critiques, follow-up reading.