ICCV 2023 Workshop on Vision and Language Algorithmic Reasoning (VLAR)
Event: ICCV 2023 Workshop on Vision and Language Algorithmic Reasoning · Duration: 45 min · ▶ Watch on YouTube
Abstract
This workshop session explores various aspects of Visual Question Answering (VQA) and related multi-modal reasoning tasks. Presentations cover novel self-supervised graph neural networks for scene-based VQA, the development of new datasets for benchmarking counterfactual reasoning in multi-modal language models, and insights into text-based video question answering. A key focus is on addressing the limitations of current models in handling complex reasoning, temporal information, and diverse answer groundings. The session highlights the need for more robust and interpretable VQA solutions that can leverage fine-grained multi-modal fusion and understand the nuances of human-like reasoning.
Speakers
- Bruno Souza — UNICAMP / Universitetet i Oslo
- Honglu Zhou — Tongji University / University of Warwick / LunarAI
- Soumya Jahagirdar — International Institute of Information Technology, Hyderabad / Wadhwani AI / Computer Vision Center, UAB
- Mobeen Ahmad — PYLER Co., LTD.
- Chongyan Chen — University of Washington
Talks (5)
- 00:00:00 — Bruno Souza: SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based Question Answering
- This talk introduces SelfGraphVQA, a self-supervised graph neural network that leverages generated scene graphs and similarity maximization strategies to enhance visual information for VQA, demonstrating improved performance and robustness on the GQA dataset.
- 02:22:00 — Honglu Zhou: What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
- This presentation introduces C-VQA, a new dataset designed to benchmark the counterfactual reasoning abilities of multi-modal large language models, revealing that current state-of-the-art models struggle significantly with counterfactual questions.
- 03:00:00 — Soumya Jahagirdar: Understanding Video Scenes through Text: Insights from Text-based Video Question Answering
- This talk explores the capabilities of text-based video question answering models on NewsVideoQA and M4-ViteVQA datasets, highlighting that most questions can be answered from single frames and rely heavily on textual information, with BERT-QA showing strong performance.
- 04:56:00 — Mobeen Ahmad: MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering
- This presentation introduces Multi-Modal Temporal Fusion (MMTF), a novel approach for commonsense video question answering that learns a unified representation by expanding text tokens along the video’s temporal axis, achieving improved performance on counterfactual and predictive VQA tasks.
- 05:53:00 — Chongyan Chen: VQA Therapy: Exploring Answer Differences by Grounding Answers
- This talk presents VQA Therapy, a new dataset that visually grounds unique answers to visual questions, and benchmarks existing VQA models on tasks of identifying and locating answer groundings, revealing challenges in recognizing multiple valid interpretations.
Key Takeaways
- Scene graphs, especially when combined with self-supervised learning, can significantly improve VQA performance and robustness, even surpassing human baselines on certain datasets.
- Current state-of-the-art multi-modal language models still struggle with counterfactual reasoning, highlighting a critical area for future research and the need for specialized datasets like C-VQA.
- Text-based video QA datasets often lack questions requiring multi-frame temporal reasoning, with many answers derivable from single frames or textual cues, suggesting a potential bias in current benchmarks.
- Fine-grained multi-modal temporal fusion is crucial for commonsense video QA, enabling models to learn unified representations from text and video that support complex reasoning tasks like prediction and counterfactual analysis.
- The existence of multiple valid answer groundings for a single visual question poses a significant challenge for VQA models, and new datasets and benchmarking tasks are needed to develop models that can recognize and leverage this diversity for more robust and trustworthy AI.
Methods / Models / Datasets Mentioned
SelfGraphVQAGQA datasetBERTSelfSimMCANVILBERTC-VQA datasetViperGPTVisProgInstructBLIPLLaVABLIP2GPT-3.5 TurboNewsVideoQA datasetM4-ViteVQA datasetBERT-QAT5-ViteVQAOCR-aware SINGULARITYMMTF (Multi-Modal Temporal Fusion)Causal-VidQANEXT-QAAGQA-2.0MSVD-QAHMECoMemHCRNHGAB2AVQA-TCo-MemHAIRMASNIGVMHNVGTHQGAVQA Therapy datasetViLTmPLUG-OwlVizWiz-VQA
Topics
Visual Question Answering (VQA) · Scene Graphs · Self-Supervised Learning · Graph Neural Networks (GNN) · Counterfactual Reasoning · Multi-modal Language Models · Video Question Answering · Temporal Reasoning · Text-based VQA · Answer Grounding
Notes
Open for commentary — connections to other work, critiques, follow-up reading.