ICCV 2023 Workshop on Vision and Language Algorithmic Reasoning (VLAR)

Event: ICCV 2023 Workshop on Vision and Language Algorithmic Reasoning · Duration: 45 min · ▶ Watch on YouTube

Abstract

This workshop session explores various aspects of Visual Question Answering (VQA) and related multi-modal reasoning tasks. Presentations cover novel self-supervised graph neural networks for scene-based VQA, the development of new datasets for benchmarking counterfactual reasoning in multi-modal language models, and insights into text-based video question answering. A key focus is on addressing the limitations of current models in handling complex reasoning, temporal information, and diverse answer groundings. The session highlights the need for more robust and interpretable VQA solutions that can leverage fine-grained multi-modal fusion and understand the nuances of human-like reasoning.

Speakers

Bruno Souza — UNICAMP / Universitetet i Oslo
Honglu Zhou — Tongji University / University of Warwick / LunarAI
Soumya Jahagirdar — International Institute of Information Technology, Hyderabad / Wadhwani AI / Computer Vision Center, UAB
Mobeen Ahmad — PYLER Co., LTD.
Chongyan Chen — University of Washington

Talks (5)

00:00:00 — Bruno Souza: SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based Question Answering
- This talk introduces SelfGraphVQA, a self-supervised graph neural network that leverages generated scene graphs and similarity maximization strategies to enhance visual information for VQA, demonstrating improved performance and robustness on the GQA dataset.
02:22:00 — Honglu Zhou: What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
- This presentation introduces C-VQA, a new dataset designed to benchmark the counterfactual reasoning abilities of multi-modal large language models, revealing that current state-of-the-art models struggle significantly with counterfactual questions.
03:00:00 — Soumya Jahagirdar: Understanding Video Scenes through Text: Insights from Text-based Video Question Answering
- This talk explores the capabilities of text-based video question answering models on NewsVideoQA and M4-ViteVQA datasets, highlighting that most questions can be answered from single frames and rely heavily on textual information, with BERT-QA showing strong performance.
04:56:00 — Mobeen Ahmad: MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering
- This presentation introduces Multi-Modal Temporal Fusion (MMTF), a novel approach for commonsense video question answering that learns a unified representation by expanding text tokens along the video’s temporal axis, achieving improved performance on counterfactual and predictive VQA tasks.
05:53:00 — Chongyan Chen: VQA Therapy: Exploring Answer Differences by Grounding Answers
- This talk presents VQA Therapy, a new dataset that visually grounds unique answers to visual questions, and benchmarks existing VQA models on tasks of identifying and locating answer groundings, revealing challenges in recognizing multiple valid interpretations.

Key Takeaways

Scene graphs, especially when combined with self-supervised learning, can significantly improve VQA performance and robustness, even surpassing human baselines on certain datasets.
Current state-of-the-art multi-modal language models still struggle with counterfactual reasoning, highlighting a critical area for future research and the need for specialized datasets like C-VQA.
Text-based video QA datasets often lack questions requiring multi-frame temporal reasoning, with many answers derivable from single frames or textual cues, suggesting a potential bias in current benchmarks.
Fine-grained multi-modal temporal fusion is crucial for commonsense video QA, enabling models to learn unified representations from text and video that support complex reasoning tasks like prediction and counterfactual analysis.
The existence of multiple valid answer groundings for a single visual question poses a significant challenge for VQA models, and new datasets and benchmarking tasks are needed to develop models that can recognize and leverage this diversity for more robust and trustworthy AI.

Methods / Models / Datasets Mentioned

SelfGraphVQA
GQA dataset
BERT
SelfSim
MCAN
VILBERT
C-VQA dataset
ViperGPT
VisProg
InstructBLIP
LLaVA
BLIP2
GPT-3.5 Turbo
NewsVideoQA dataset
M4-ViteVQA dataset
BERT-QA
T5-ViteVQA
OCR-aware SINGULARITY
MMTF (Multi-Modal Temporal Fusion)
Causal-VidQA
NEXT-QA
AGQA-2.0
MSVD-QA
HME
CoMem
HCRN
HGA
B2A
VQA-T
Co-Mem
HAIR
MASN
IGV
MHN
VGT
HQGA
VQA Therapy dataset
ViLT
mPLUG-Owl
VizWiz-VQA

Topics

Visual Question Answering (VQA) · Scene Graphs · Self-Supervised Learning · Graph Neural Networks (GNN) · Counterfactual Reasoning · Multi-modal Language Models · Video Question Answering · Temporal Reasoning · Text-based VQA · Answer Grounding

Notes

Open for commentary — connections to other work, critiques, follow-up reading.