ICCVW 2023 VLAR Session 3

Event: ICCVW 2023 VLAR · Duration: 28 min · ▶ Watch on YouTube

Abstract

This workshop session features three presentations on cutting-edge research in vision and language. The first talk introduces Uni-NLX, a unified model for generating human-friendly textual explanations across diverse vision and vision-language tasks. The second presentation, Le-RNR-Map, explores the creation of explicit map representations for embodied AI, enabling natural language querying and navigation using Neural Radiance Fields. Finally, CLIP-Decoder is presented as a novel approach to zero-shot multi-label classification, leveraging multimodal CLIP-aligned representations for improved performance on unseen categories.

Speakers

Fawaz Sammani — Vrije Universiteit Brussel (VUB), imec
Nikos Deligiannis — Vrije Universiteit Brussel (VUB), imec
Francesco Taioli — University of Verona
Federico Cunico — University of Verona
Federico Girella — University of Verona
Riccardo Bologna — University of Verona
Alessandro Farinelli — University of Verona
Marco Cristani — University of Verona
Muhammad Ali — Mohamed Bin Zayed University of Artificial Intelligence
Salman Khan — Mohamed Bin Zayed University of Artificial Intelligence

Talks (3)

00:00:46 — Fawaz Sammani: Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
- This paper introduces Uni-NLX, a single compact model that unifies seven different vision and vision-language tasks by generating human-friendly textual explanations, outperforming task-specific models with fewer parameters.
00:01:20 — Francesco Taioli: Language-enhanced RNR-Map: Querying Renderable Neural Radiance Field maps with natural language
- This paper presents Le-RNR-Map, a system that creates explicit map representations from RGB-D data and allows for natural language querying and navigation within these maps, leveraging NeRF reconstruction and negative prompts for improved object localization.
00:04:53 — Muhammad Ali: CLIP-Decoder: ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representations
- This paper addresses zero-shot multi-label classification by proposing CLIP-Decoder, a model that leverages multimodal CLIP-aligned representations and custom templates to effectively classify multiple unseen labels in images, outperforming existing methods.

Key Takeaways

Textual explanations provide more detailed and human-understandable insights into model predictions compared to traditional visual explanations like heatmaps.
Unified models like Uni-NLX can achieve competitive performance across multiple vision and vision-language tasks with significantly fewer parameters, promoting efficiency and knowledge transfer.
Large Language Models are powerful tools for generating synthetic, rich explanation datasets, which can be used to train models for complex tasks without extensive human annotation.
Explicit map representations, enhanced with multimodal features and NeRF reconstruction, enable advanced capabilities for embodied AI agents, including natural language-driven object search and navigation.
Negative prompts and custom templates can be effectively integrated into vision-language models to improve localization accuracy and zero-shot classification performance by guiding the model’s attention and leveraging fine-grained feature descriptions.

Methods / Models / Datasets Mentioned

Uni-NLX
GradCAM
NLX-GPT
CLIP
CuPL
LaFter
ImageNetX
VQA-ParaX
T-SNE
Le-RNR-Map
NeRF
Generative Scene Networks (GSNs)
CLIP-Decoder
TResNet
ML-Decoder
GAP-SDL
LESA
BIAM
SDM
NUS-WIDE Dataset

Topics

Textual Explanations · Vision-Language Tasks · Explainable AI (XAI) · Unified Models · Large Language Models (LLMs) · Zero-shot Learning · Embodied AI · Neural Radiance Fields (NeRF) · Object Localization · Multi-label Classification

Notes

Open for commentary — connections to other work, critiques, follow-up reading.