Unsolved problems in video understanding

Event: CVPR 2023 · Duration: 41 min · ▶ Watch on YouTube

Abstract

The presentation begins by reviewing the significant progress made in computer vision over the last decade across Recognition, Reconstruction, and Reorganization, especially for short-term video understanding (1-3 seconds). It showcases impressive results in 3D human reconstruction and atomic action recognition. However, the core of the talk shifts to the ‘unsolved problem’ of long-range video understanding, arguing that current token-based large language models (LLMs) alone are insufficient. The speaker introduces the EgoSchema dataset, a diagnostic benchmark designed to evaluate understanding of complex, long-duration egocentric videos, and demonstrates a substantial gap between human performance and state-of-the-art models. The talk concludes by advocating for an approach that integrates both visual (movement) and linguistic (goals/intentionality) understanding, acknowledging the immense data complexity of the 4D visual world compared to text.

Speakers

Jitendra Malik — UC Berkeley, FAIR @ Meta

Talks (1)

00:00:00 — Jitendra Malik: Unsolved problems in video understanding
- This talk explores the current state of video understanding, focusing on the ‘3Rs of Vision’ (Recognition, Reconstruction, Reorganization) for short-term video, and then delves into the challenges and future directions for long-range video understanding, particularly emphasizing the interplay between visual data and language models.

Key Takeaways

Significant progress has been made in short-term video understanding (1-3 seconds) for 3D human reconstruction, tracking, and atomic action recognition, achieving performance levels comparable to object detection.
Long-range video understanding, which involves complex human behaviors, goals, and intentionality over minutes or hours, remains a largely unsolved problem.
The EgoSchema dataset, featuring 3-minute egocentric video clips and manually curated question-answer pairs, highlights a huge gap between human performance (76%) and state-of-the-art vision-language models (<35%).
Simply scaling token-based Large Language Models (LLMs) is unlikely to be the complete answer for long-range video understanding due to the fundamental difference in data complexity (visual data has 500-1000x more ‘tokens’ than text for the same narrative) and the inability of tokens to fully capture the essence of the 4D world.
A holistic approach to video understanding requires integrating both visual data (for movement and spatiotemporal details) and language (for high-level goals, intentionality, and plans), recognizing the distinct roles of exteroception and proprioception in building mental models of the world.

Methods / Models / Datasets Mentioned

Transformers
Hiera
LART
SlowFast
ACAR-Net
MViTv2-L
VideoMAE
FrozenBiLM
VIOLET
mPLUG-Owl
InternVideo
GPT-3
GPT-4
Llama-2

Topics

Video Understanding · 3D Human Reconstruction · Action Recognition · Temporal Certificates · Long-range Video Understanding · Egocentric Video · EgoSchema Dataset · Vision-Language Models · Mental Models · Exteroception · Proprioception

Notes

Open for commentary — connections to other work, critiques, follow-up reading.