Multi-stage reasoning for video understanding & scene generation

Event: CVPR 2024 · Duration: 26 min · ▶ Watch on YouTube

Abstract

The presentation introduces two main areas: video question answering (VideoQA) with reasoning and 3D scene synthesis using LLM-generated Blender code. For VideoQA, a multi-stage modular reasoning model (MoRevQA) is proposed, which decomposes the task into event parsing, grounding, and reasoning stages, utilizing shared memory between stages. This approach offers better interpretability and consistently improves accuracy across various video datasets compared to end-to-end models and simpler baselines. For 3D scene generation, SceneCraft, an LLM agent, synthesizes 3D scenes from text descriptions by generating Blender code through an iterative planning and self-refinement loop. This method provides fine-grained control over scene elements and relationships, outperforming existing BlenderGPT approaches and demonstrating potential for guiding video generation.

Speakers

  • Cordelia Schmid — Google DeepMind

Talks (1)

  • 00:00:00 — Cordelia Schmid: Multi-stage reasoning for video understanding & scene generation
    • This talk explores multi-stage reasoning for video understanding and 3D scene generation, leveraging LLMs with tools and memory for improved interpretability and performance.

Key Takeaways

  • Modular, multi-stage reasoning models (like MoRevQA) offer significant advantages in interpretability and performance for complex tasks like VideoQA compared to black-box end-to-end models.
  • A simple baseline (JCEF) that captions every frame and uses an LLM for reasoning can surprisingly outperform more complex program-generating models like ViperGPT in VideoQA.
  • LLM agents can effectively synthesize high-quality 3D scenes from text descriptions by generating and iteratively refining Blender code, providing fine-grained control over scene elements and spatial relationships.
  • The use of generated 3D scenes as fine-grained control signals can significantly improve the performance of text-to-video generation models, as demonstrated by the FVD score on the Sintel movie dataset.
  • Future work involves leveraging video understanding to learn stories, generate scripts, and then synthesize 3D scenes and videos, creating a comprehensive pipeline for content generation.

Methods / Models / Datasets Mentioned

  • PALI-{X,3}
  • BLIP
  • Flamingo
  • Gemini
  • ViperGPT
  • VisProg
  • CodeVQA
  • JCEF
  • MoRevQA
  • NEXT-QA
  • iVQA
  • EgoSchema
  • ActivityNet-QA
  • SceneCraft
  • BlenderGPT
  • CLIP SIM
  • FVD

Topics

Video Question Answering (VideoQA) · Multi-stage reasoning · Modular models · LLM tool-use · Memory augmentation · 3D scene generation · Blender code generation · Self-refinement loop · Video generation control · Interpretability


Notes

Open for commentary — connections to other work, critiques, follow-up reading.