Multimodal Algorithmic Reasoning Workshop & SMART-101 Challenge Awards

Event: CVPR 2024 Workshop · Duration: 209 min · ▶ Watch on YouTube

Abstract

This video captures a series of talks and award presentations from the Multimodal Algorithmic Reasoning Workshop and SMART-101 Challenge at CVPR 2024. The event features leading researchers presenting on topics such as abstraction in humans and machines, spatial representations in multimodal AI systems, autonomous evaluation and refinement of digital agents, and recent advances in vision foundation models. A key highlight is the introduction of FunSearch, a novel method combining large language models with program search to discover new mathematical knowledge and efficient algorithms. The workshop also includes the announcement of the Best Paper Award and the SMART-101 Challenge winners, recognizing outstanding contributions in the field.

Speakers

Anoop Cherian

Talks (9)

00:00:00 — Anoop Cherian: Multimodal Algorithmic Reasoning Workshop & SMART-101 Challenge Awards
- Welcome and introduction to the Multimodal Algorithmic Reasoning Workshop and SMART-101 Challenge Awards.
00:29:00 — Mark Ho: Abstraction in Humans and Machines
- This talk explores the concept of abstraction in human cognition and its implications for machine learning, particularly in the context of general-purpose cognition and multimodal representations.
01:07:00 — Scott O. Murray: Spatial Representations in Multimodal AI Systems
- This talk investigates the ability of multimodal AI systems, specifically GPT-4, to form and transform spatial representations, drawing parallels with human visual and cognitive neuroscience.
01:14:44 — Dobrik Georgiev: The Deep Equilibrium Algorithmic Reasoner
- This talk introduces the Deep Equilibrium Algorithmic Reasoner (DEAR), a novel approach that combines deep equilibrium models with algorithmic reasoning to solve complex visio-linguistic puzzles.
01:19:48 — Jiayi Pan: Autonomous Evaluation and Refinement of Digital Agents
- This talk presents a framework for autonomous evaluation and refinement of digital agents, leveraging large language models (LLMs) to improve agent performance without human intervention.
01:59:59 — Lijuan Wang: Recent Advances in Vision Foundation Models
- This talk provides an overview of recent advances in vision foundation models, highlighting the evolution from CLIP to large multimodal models (LMMs) and diffusion models, and their potential as world simulators.
02:37:32 — Emilien Dupont: FunSearch: Mathematical discoveries from program search with LLMs
- This talk introduces FunSearch, a novel method that combines large language models (LLMs) with program search to discover new mathematical knowledge and efficient algorithms for impactful problems.

Key Takeaways

The workshop highlights the growing importance of multimodal AI systems in achieving general-purpose cognition, drawing parallels between human abstraction and machine learning capabilities.
Spatial reasoning remains a significant challenge for current multimodal AI models like GPT-4, which struggle with transformations and viewer-centered perspectives, indicating a need for new computational paradigms.
FunSearch, a novel method combining LLMs with program search, demonstrates promising results in discovering new mathematical knowledge and efficient algorithms, outperforming existing state-of-the-art computational solvers.
The rapid progress in large multimodal models (LMMs) is evident, with open-source and proprietary models continually improving their performance on benchmarks like MM-Vet, showcasing emergent capabilities in visual pointing, spot-the-difference tasks, and interleaved image-text sequences.
Future research in multimodal AI is encouraged to focus on grounding LMMs, visual prompting, and multimodal agents, with a particular emphasis on developing world simulators and addressing the limitations of current LLMs in reasoning and generalization.

Methods / Models / Datasets Mentioned

GPT-4
FunSearch
CLIP
LMMs
GPT-4V
Diffusion Model
DALL-E 3
SORA
GIT
Flamingo
LLaVA
MM-Vet
MMMU

Topics

Multimodal AI · Algorithmic Reasoning · Abstraction · Spatial Representations · Vision Foundation Models · Large Language Models (LLMs) · Program Search · Mathematical Discoveries · Autonomous Agents · SMART-101 Challenge

Notes

Open for commentary — connections to other work, critiques, follow-up reading.