Vision and Language Algorithmic Reasoning Work & SMART-101 Challenge Awards

Event: ICCV 2023 1st Workshop on Vision and Language Algorithmic Reasoning · Duration: 19 min · ▶ Watch on YouTube

Abstract

This workshop session presents the awards for the ICCV 2023 Vision and Language Algorithmic Reasoning (VLAR) workshop, including best paper awards sponsored by Math Kangaroo USA and the SMART-101 Challenge Winner Award sponsored by Mitsubishi Electric Research Labs. The winning team of the SMART-101 Challenge provides a detailed presentation of their solution, outlining their multimodal approach to solving complex visual question answering puzzles. The session concludes with closing remarks, highlighting the significance of algorithmic reasoning for generalization in AI and announcing future workshop events.

Speakers

Anoop Cherian — Mitsubishi Electric Research Labs
Xiangyu Wu — Nanjing University of Science and Technology
Honglu Zhou — Nanjing University of Science and Technology

Talks (3)

00:00:00 — Anoop Cherian: Vision and Language Algorithmic Reasoning Work & SMART-101 Challenge Awards
- Presentation of Best Paper Awards and the SMART-101 Challenge Award, highlighting the sponsors and winning teams.
09:32:00 — Xiangyu Wu: Solution For SMART-101 Challenge (ICCV 2023)
- The winning team presents their solution for the SMART-101 Challenge, detailing their method, architecture, and results on the private test set.
16:45:00 — Anoop Cherian: Closing Remarks
- Concluding remarks for the workshop, emphasizing the importance of algorithmic reasoning, future events, and opportunities for researchers.

Key Takeaways

Best Paper Awards were given for research in unifying textual explanations for vision-language tasks, iterative robust visual grounding, and self-supervised graph neural networks for scene-based question answering.
The SMART-101 Challenge requires models to solve elementary mathematical and logical puzzles from images and text, demanding out-of-distribution generalization.
The winning SMART-101 solution utilized a divide-and-conquer approach, large language models (Llama-2) for question type prediction, object detection (YOLOv7), OCR (paddleocr), and BLIP-2 with visual adapters.
The winning team achieved significant performance improvements on text-only puzzles but noted less significant gains on vision-language puzzles, indicating continued challenges in multimodal reasoning.
The workshop emphasized the growing importance of algorithmic reasoning for achieving robust generalization in AI, particularly in scenarios requiring complex logical and mathematical inference beyond simple pattern recognition.

Methods / Models / Datasets Mentioned

Uni-NLX
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision
SelfGraphVQA
BLIP-2
BLIP-2-Flan-t5-xxl
YOLOv7
OCR
paddleocr
Llama-2
AdaptFormer

Topics

Vision and Language · Algorithmic Reasoning · SMART-101 Challenge · Best Paper Awards · Multimodal AI · Visual Question Answering · Generalization · Mathematical Reasoning · Out-of-Distribution Generalization · AI Workshops

Notes

Open for commentary — connections to other work, critiques, follow-up reading.