Multimodal Foundational Models: MM Video Understanding & Vision-Language Guided Robotics

Event: CVPR 2024 Workshop · Duration: 128 min · ▶ Watch on YouTube

Abstract

The CVPR 2024 Workshop on Multimodal Foundational Models delved into the intersection of MM video understanding and vision-language guided robotics. Speakers highlighted the current limitations of large language models in achieving true embodied understanding and physical reasoning, particularly in complex real-world scenarios. The discussions emphasized the critical role of learning from procedural videos to train intelligent agents capable of performing intricate tasks and predicting state transitions. Research presented covered diverse areas including dense video captioning, modular reasoning for video question answering, and innovative approaches to enhance human-robot interaction through multimodal interfaces. A recurring theme was the importance of leveraging vast datasets, from instructional videos to movies, to foster AI’s ability to comprehend and interact with the physical world across various temporal and semantic scales.

Speakers

Ivan Laptev — Inria, ENS, PSL
Cordelia Schmid — Inria, ENS, PSL
Juho Kim — KAIST School of Computing
Dima Damen — University of Bristol

Talks (4)

00:00:00 — Ivan Laptev: Embodied agents and learning from procedural videos
- Discusses the limitations of current multimodal models (like GPT-4o) in embodied understanding and physical reasoning, proposing that learning from procedural videos is key to developing embodied agents that can perform complex tasks and predict state changes in the real world.
00:34:35 — Cordelia Schmid: Multimodal foundational models: MM video understanding & vision-language guided robotics
- Explores multimodal video representation for dense video captioning and video question answering, emphasizing the importance of large-scale cross-modal supervision and modular reasoning for vision-language guided robotics.
00:58:11 — Juho Kim: ENHANCING HUMAN INTERACTION WITH PROCEDURAL VIDEOS
- Explores interaction-centric research to improve how people interact with procedural videos, focusing on learnersourcing for collective knowledge construction, multimodal user interaction, task representation, and capturing tacit knowledge.
01:30:15 — Dima Damen: Procedural Videos from the Fine to the Coarse
- Discusses the need for fine-grained understanding, context, and global understanding in procedural videos, presenting work on skill determination, action modifiers (learning from adverbs), Ego-Exo4D annotations, repetition counting, cross-modal video retrieval, and audio-visual action recognition.

Key Takeaways

Current multimodal models, while powerful, often lack deep embodied understanding and physical reasoning crucial for real-world robotic applications.
Learning from procedural videos, especially through large-scale cross-modal supervision, is a promising avenue for developing intelligent agents that can understand and execute complex tasks.
Modular reasoning and multi-stage approaches are essential for tackling challenging video question answering tasks that require long temporal context and grounded execution.
Human-AI collaboration, particularly through learnersourcing and multimodal interfaces, can enhance both human learning from videos and AI’s ability to capture tacit knowledge and improve interaction.
Future research needs to focus on generating more diverse and long-range datasets, improving efficiency in processing long videos, and developing models that can effectively represent and infer tacit knowledge for robust real-world interaction.

Methods / Models / Datasets Mentioned

GPT-4o
Changelt dataset
GenHowTo
ViViDex
DexYCB
HowTo100M
Vid2Seq
CLIP
How2QA
MovieQA
ActivityNet-QA
EgoSchema
CinePile
NEXT-QA
TVQA
LVU
MovieChat
TimeChat
SFD
MoRevQA
JCEF
ViperGPT
Crowdy
RubySlippers
ExpressEdit
BLIP2
InternVideo
SAM
SURCH
SoftVideo
RepCount
Countix
UCFRep
ConTra
TIM
EPIC-KITCHENS
MFormer-HR
MoViNet-A6
MeMViT
Omnivore
LaViLa (TSF-L)
AVION (ViT-L)
TBN
MBT
MTCN
M&M
MLP

Topics

Multimodal AI · Video Understanding · Vision-Language Guided Robotics · Embodied Agents · Procedural Videos · Dense Video Captioning · Video Question Answering · Human-Robot Interaction · Tacit Knowledge

Notes

Open for commentary — connections to other work, critiques, follow-up reading.