ICCVW 2023 VLAR Session 3

Event: ICCVW 2023 VLAR · Duration: 28 min · ▶ Watch on YouTube

Abstract

This workshop session features three presentations on cutting-edge research in vision and language. The first talk introduces Uni-NLX, a unified model for generating human-friendly textual explanations across diverse vision and vision-language tasks. The second presentation, Le-RNR-Map, explores the creation of explicit map representations for embodied AI, enabling natural language querying and navigation using Neural Radiance Fields. Finally, CLIP-Decoder is presented as a novel approach to zero-shot multi-label classification, leveraging multimodal CLIP-aligned representations for improved performance on unseen categories.

Speakers

  • Fawaz Sammani — Vrije Universiteit Brussel (VUB), imec
  • Nikos Deligiannis — Vrije Universiteit Brussel (VUB), imec
  • Francesco Taioli — University of Verona
  • Federico Cunico — University of Verona
  • Federico Girella — University of Verona
  • Riccardo Bologna — University of Verona
  • Alessandro Farinelli — University of Verona
  • Marco Cristani — University of Verona
  • Muhammad Ali — Mohamed Bin Zayed University of Artificial Intelligence
  • Salman Khan — Mohamed Bin Zayed University of Artificial Intelligence

Talks (3)

  • 00:00:46Fawaz Sammani: Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
    • This paper introduces Uni-NLX, a single compact model that unifies seven different vision and vision-language tasks by generating human-friendly textual explanations, outperforming task-specific models with fewer parameters.
  • 00:01:20Francesco Taioli: Language-enhanced RNR-Map: Querying Renderable Neural Radiance Field maps with natural language
    • This paper presents Le-RNR-Map, a system that creates explicit map representations from RGB-D data and allows for natural language querying and navigation within these maps, leveraging NeRF reconstruction and negative prompts for improved object localization.
  • 00:04:53Muhammad Ali: CLIP-Decoder: ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representations
    • This paper addresses zero-shot multi-label classification by proposing CLIP-Decoder, a model that leverages multimodal CLIP-aligned representations and custom templates to effectively classify multiple unseen labels in images, outperforming existing methods.

Key Takeaways

  • Textual explanations provide more detailed and human-understandable insights into model predictions compared to traditional visual explanations like heatmaps.
  • Unified models like Uni-NLX can achieve competitive performance across multiple vision and vision-language tasks with significantly fewer parameters, promoting efficiency and knowledge transfer.
  • Large Language Models are powerful tools for generating synthetic, rich explanation datasets, which can be used to train models for complex tasks without extensive human annotation.
  • Explicit map representations, enhanced with multimodal features and NeRF reconstruction, enable advanced capabilities for embodied AI agents, including natural language-driven object search and navigation.
  • Negative prompts and custom templates can be effectively integrated into vision-language models to improve localization accuracy and zero-shot classification performance by guiding the model’s attention and leveraging fine-grained feature descriptions.

Methods / Models / Datasets Mentioned

  • Uni-NLX
  • GradCAM
  • NLX-GPT
  • CLIP
  • CuPL
  • LaFter
  • ImageNetX
  • VQA-ParaX
  • T-SNE
  • Le-RNR-Map
  • NeRF
  • Generative Scene Networks (GSNs)
  • CLIP-Decoder
  • TResNet
  • ML-Decoder
  • GAP-SDL
  • LESA
  • BIAM
  • SDM
  • NUS-WIDE Dataset

Topics

Textual Explanations · Vision-Language Tasks · Explainable AI (XAI) · Unified Models · Large Language Models (LLMs) · Zero-shot Learning · Embodied AI · Neural Radiance Fields (NeRF) · Object Localization · Multi-label Classification


Notes

Open for commentary — connections to other work, critiques, follow-up reading.