Multimodal Foundational Models: MM Video Understanding & Vision-Language Guided Robotics
Event: CVPR 2024 Workshop · Duration: 128 min · ▶ Watch on YouTube
Abstract
The CVPR 2024 Workshop on Multimodal Foundational Models delved into the intersection of MM video understanding and vision-language guided robotics. Speakers highlighted the current limitations of large language models in achieving true embodied understanding and physical reasoning, particularly in complex real-world scenarios. The discussions emphasized the critical role of learning from procedural videos to train intelligent agents capable of performing intricate tasks and predicting state transitions. Research presented covered diverse areas including dense video captioning, modular reasoning for video question answering, and innovative approaches to enhance human-robot interaction through multimodal interfaces. A recurring theme was the importance of leveraging vast datasets, from instructional videos to movies, to foster AI’s ability to comprehend and interact with the physical world across various temporal and semantic scales.
Speakers
- Ivan Laptev — Inria, ENS, PSL
- Cordelia Schmid — Inria, ENS, PSL
- Juho Kim — KAIST School of Computing
- Dima Damen — University of Bristol
Talks (4)
- 00:00:00 — Ivan Laptev: Embodied agents and learning from procedural videos
- Discusses the limitations of current multimodal models (like GPT-4o) in embodied understanding and physical reasoning, proposing that learning from procedural videos is key to developing embodied agents that can perform complex tasks and predict state changes in the real world.
- 00:34:35 — Cordelia Schmid: Multimodal foundational models: MM video understanding & vision-language guided robotics
- Explores multimodal video representation for dense video captioning and video question answering, emphasizing the importance of large-scale cross-modal supervision and modular reasoning for vision-language guided robotics.
- 00:58:11 — Juho Kim: ENHANCING HUMAN INTERACTION WITH PROCEDURAL VIDEOS
- Explores interaction-centric research to improve how people interact with procedural videos, focusing on learnersourcing for collective knowledge construction, multimodal user interaction, task representation, and capturing tacit knowledge.
- 01:30:15 — Dima Damen: Procedural Videos from the Fine to the Coarse
- Discusses the need for fine-grained understanding, context, and global understanding in procedural videos, presenting work on skill determination, action modifiers (learning from adverbs), Ego-Exo4D annotations, repetition counting, cross-modal video retrieval, and audio-visual action recognition.
Key Takeaways
- Current multimodal models, while powerful, often lack deep embodied understanding and physical reasoning crucial for real-world robotic applications.
- Learning from procedural videos, especially through large-scale cross-modal supervision, is a promising avenue for developing intelligent agents that can understand and execute complex tasks.
- Modular reasoning and multi-stage approaches are essential for tackling challenging video question answering tasks that require long temporal context and grounded execution.
- Human-AI collaboration, particularly through learnersourcing and multimodal interfaces, can enhance both human learning from videos and AI’s ability to capture tacit knowledge and improve interaction.
- Future research needs to focus on generating more diverse and long-range datasets, improving efficiency in processing long videos, and developing models that can effectively represent and infer tacit knowledge for robust real-world interaction.
Methods / Models / Datasets Mentioned
GPT-4oChangelt datasetGenHowToViViDexDexYCBHowTo100MVid2SeqCLIPHow2QAMovieQAActivityNet-QAEgoSchemaCinePileNEXT-QATVQALVUMovieChatTimeChatSFDMoRevQAJCEFViperGPTCrowdyRubySlippersExpressEditBLIP2InternVideoSAMSURCHSoftVideoRepCountCountixUCFRepConTraTIMEPIC-KITCHENSMFormer-HRMoViNet-A6MeMViTOmnivoreLaViLa (TSF-L)AVION (ViT-L)TBNMBTMTCNM&MMLP
Topics
Multimodal AI · Video Understanding · Vision-Language Guided Robotics · Embodied Agents · Procedural Videos · Dense Video Captioning · Video Question Answering · Human-Robot Interaction · Tacit Knowledge
Notes
Open for commentary — connections to other work, critiques, follow-up reading.