VISION-AND-LANGUAGE ALGORITHMIC REASONING (VLAR)

Event: International Conference on Computer Vision 2023 Workshop · Duration: 12 min · ▶ Watch on YouTube

Abstract

This workshop introduction emphasizes the unique and exciting progress in computer vision, natural language processing, and multimodal models, particularly in the area of vision-and-language algorithmic reasoning. The speaker illustrates current challenges using examples like a bird-on-fence math problem and city map navigation, showing that while large language models like ChatGPT-4 and Bard can perform complex reasoning, they often struggle with precise spatial understanding, numerical accuracy, and appropriate confidence calibration. The discussion highlights the need for models that can construct and reason about world models, manage uncertainty, and combine modular components for language and vision, moving beyond single large pre-trained models.

Speakers

Joshua B. Tenenbaum — Massachusetts Institute of Technology (MIT)
Anoop Cherian — Mitsubishi Electric Research Labs (MERL)
Kevin A. Smith — Massachusetts Institute of Technology (MIT)

Talks (1)

00:00:00 — Joshua B. Tenenbaum: Introduction to Vision-and-Language Algorithmic Reasoning (VLAR)
- An introductory talk highlighting the current state and challenges of vision-and-language algorithmic reasoning, using examples to demonstrate the limitations of current large language models in tasks requiring precise spatial and numerical understanding.

Key Takeaways

Current large multimodal models demonstrate impressive language capabilities but often lack precise algorithmic reasoning, especially in tasks involving spatial and numerical understanding.
Models like ChatGPT-4 can generate plausible reasoning steps but may fail on geometric or numerical details, leading to incorrect answers despite appearing to ‘understand’ the problem.
Overconfidence and lack of self-correction are significant issues, as models may insist on incorrect answers or provide flawed justifications when challenged.
Future research should focus on developing models that can construct and reason about explicit world models, manage uncertainty, and integrate modular components for vision and language, rather than relying solely on end-to-end pre-trained models.
Combining language models with symbolic representations and physics simulators (e.g., Physics in a Language of Thought) offers a promising direction for achieving more robust and human-like algorithmic reasoning.

Methods / Models / Datasets Mentioned

ChatGPT-4
Bard
SMART-101 dataset
Physics in a Language of Thought (PloT)

Topics

Vision-and-Language Reasoning · Algorithmic Reasoning · Multimodal Models · Large Language Models (LLMs) · Cognitive AI · Spatial Reasoning · Numerical Reasoning · Intuitive Physics · World Models · Uncertainty

Notes

Open for commentary — connections to other work, critiques, follow-up reading.