Video Foundation Models: From Black Boxes to Controllable Representations

Event: CVPR 2024 Workshop · Duration: 387 min · ▶ Watch on YouTube

Abstract

The speaker discusses the rapid advancements in generative AI for video, showcasing impressive AI-generated videos. However, she highlights significant limitations of current video foundation models, including huge computational costs, limited accessibility to only large industry players, and insufficient fine-grained control over generated content. The talk then introduces single-video models as a flexible and accessible alternative, demonstrating their capabilities in tasks like object removal and style transfer. The core idea presented is to combine the strengths of both approaches by distilling learned priors from large-scale “black box” models and integrating them into tailored single-video models, enabling more controllable and efficient video synthesis. This segment features two insightful talks from the CVPR 2024 Workshop. Tim Brooks from OpenAI presents Sora, a groundbreaking video generation model, demonstrating its advanced capabilities in creating realistic and stylized videos with consistent elements, and discusses its long-term potential for world simulation. Following this, Diyi Yang from Stanford University introduces Design2Code, a project aimed at automating front-end engineering by translating visual designs into code. She elaborates on the challenges of building a robust benchmark, proposes novel evaluation metrics, and analyzes the performance of various multimodal models in this context. This segment features a series of oral presentations and a keynote speech from the CVPR 2024 Workshop. The talks cover advancements in AI for content creation, including visual style prompting in diffusion models, 3D shape synthesis and editing, and methods to immunize AI models against malicious adaptations. A keynote addresses the philosophical question of whether computers can create art, tracing the historical interplay between technology and artistic expression. The segment concludes with a presentation on multi-modal generative AI using foundation models for creating avatars, objects, and scenes. This segment, presented by Robin Rombach, delves into the advancements and future of diffusion models for content creation. It highlights the rapid progress in image and video generation, showcasing examples from recent works like Sora and RunwayML’s Gen-3-alpha. The talk emphasizes the importance of scaling laws and efficient training algorithms, introducing Flow Matching and Rectified Flow as promising techniques. Furthermore, it explores distillation methods for faster inference and discusses architectural considerations for multimodal training, concluding with the potential of these models to revolutionize content creation. This segment features a panel discussion titled “Surviving (and Thriving) in GenAI Industry” at the CVPR 2024 Workshop. Speakers Robin Rombach (Stability AI), Jingwan (Cynthia) Lu (Adobe), and Rohit Girdhar (Meta) share their impressions on the rapid advancements in generative AI, emphasizing the excitement and challenges of the field. Key themes include the critical role of data quality and quantity, ethical considerations surrounding copyright and artist attribution, and the complexities of transitioning AI research into practical product applications. The panelists also discuss essential skills for budding AI professionals and the evolving landscape of AI development.

Speakers

Tali Dekel — Weizmann Institute of Science
Tim Brooks — OpenAI
Diyi Yang — Stanford University
Jaeseok Jeong — Yonsei University
Nam Anh Dinh — TTIC
Amber Yijia Zheng — Purdue University
Alper Canberk — Columbia University
Aaron Hertzmann — Adobe Research
Ziwei Liu — Nanyang Technological University
Robin Rombach — Stability AI
Jingwan (Cynthia) Lu — Head of Applied Research, GenAI, Adobe (Firefly) Modeling, Adobe
Rohit Girdhar — Research Scientist, GenAI, Meta

Talks (11)

00:00:00 — Tali Dekel: Video Foundation Models: From Black Boxes to Controllable Representations
- This talk explores the current state and limitations of video foundation models, introduces single-video models as an alternative, and proposes combining their strengths by distilling learned priors from large-scale models for fine-grained control and efficient video editing.
01:17:28 — Tim Brooks: Sora: Video Generation Models as World Simulators
- Tim Brooks from OpenAI presents Sora, a video generation model, showcasing its capabilities in generating realistic and stylized videos, maintaining consistent characters and 3D properties, and discussing its potential for artistic tools and future world simulation.
01:41:40 — Diyi Yang: Design2Code: How Far Are We From Automating Front-End Engineering?
- Diyi Yang from Stanford University introduces Design2Code, a project focused on automating front-end engineering by converting visual designs into functional code, detailing benchmark creation, evaluation metrics, and model performance.
02:35:16 — Jaeseok Jeong: Visual Style Prompting with Swapping Self-Attention
- Presents a method for visual style prompting in text-to-image diffusion models using swapping self-attention, allowing for better style reflection and content preservation compared to existing methods.
02:40:16 — Nam Anh Dinh: LoopDraw: a Loop-Based Autoregressive Model for Shape Synthesis and Editing
- Introduces LoopDraw, a novel loop-based autoregressive model for 3D shape synthesis and editing, enabling intuitive and non-local geometric effects through loop manipulation.
02:45:16 — Amber Yijia Zheng: Towards Safer AI Content Creation by Immunizing Text-to-image Models
- Addresses the issue of harmful concept re-learning in open-sourced AI models by proposing IMMA, a method to immunize text-to-image models against malicious adaptation, making it harder to re-learn erased concepts.
02:50:16 — Alper Canberk: EraseDraw: Learning to Draw Step-by-Step by Erasing Objects from Images
- Introduces EraseDraw, a method for iterative image generation by learning to insert objects through an ‘erasing’ paradigm, leveraging autonomous data generation and beam search for complex scene creation.
02:55:26 — Aaron Hertzmann: Can Computers Create Art?
- Explores the historical relationship between technology and art, arguing that AI is a tool for human artists, not an artist itself, and that new technologies often lead to new ways of making and understanding art.
03:18:01 — Ziwei Liu: Multi-Modal Generative AI with Foundation Models
- Discusses the rapid advancements in AI-generated content, particularly in multi-modal generative AI with foundation models, focusing on applications in avatars, objects, and scenes, and highlighting the potential for AI to democratize creativity.
03:52:32 — Robin Rombach: Diffusion, Distillation, Done?
- This talk explores the current state and future directions of diffusion models, focusing on scaling, efficient training, and distillation techniques for high-resolution image synthesis and multimodal content creation.
05:10:20 — Panel (Robin Rombach, Jingwan (Cynthia) Lu, Rohit Girdhar): Panel - Surviving (and Thriving) in GenAI Industry
- A panel discussion on the rapid progress of GenAI, the role of data, ethical considerations like copyright, the challenges of moving research to product, and critical skills for AI professionals.

Key Takeaways

Current video foundation models face challenges related to high computational costs, limited accessibility, and lack of fine-grained control, despite impressive generative capabilities.
Single-video models offer flexibility and accessibility, allowing for specific video editing tasks by overfitting a neural framework to a single test video.
Combining the strengths of universal video foundation models (space-time priors) and single-video models (accessibility, explicit control) is a promising direction for advanced video synthesis.
Distilling learned priors from large-scale “black box” models and incorporating them into lightweight, specialized models can enable more controllable and efficient video generation and editing.
Sora demonstrates impressive capabilities in video generation, including consistent characters, 3D understanding, and diverse styles, with potential for world simulation through scalable transformer architectures.
Automating front-end engineering via Design2Code involves converting visual designs into code, requiring robust benchmarks and nuanced evaluation metrics beyond simple code similarity.
Multimodal models like GPT-4o show strong performance in design-to-code tasks, with fine-tuning and advanced prompting techniques further enhancing their capabilities.
Human judgment in evaluating AI-generated web pages can be influenced by factors beyond direct code similarity, highlighting the need for comprehensive evaluation frameworks that consider visual fidelity, layout, and user experience.
AI models can achieve precise visual style transfer and content preservation through techniques like swapping self-attention.
Novel 3D shape representations, such as loop-based models, enable intuitive and large-scale geometric editing and synthesis.
Protecting AI models from malicious adaptation is crucial, and methods like IMMA can immunize models against re-learning harmful concepts.
AI is a powerful technological tool that transforms how art is made and understood, rather than being an artist itself, leading to new art forms and creative possibilities.
Diffusion models have made significant progress in generating high-quality images and videos, with recent examples showcasing impressive capabilities.
Scaling laws are crucial for improving model performance, and efficient training algorithms are necessary to manage the computational cost.
Flow Matching and Rectified Flow offer simpler and more efficient alternatives to traditional diffusion formalisms, particularly when combined with optimized sampling strategies.
Multimodal architectures, like MMDIT, are being developed to integrate different data types (e.g., text and image) for more versatile content creation.
The GenAI field is experiencing rapid progress, creating both excitement and challenges for researchers and industry.
Data quality and responsible data sourcing are paramount, especially when moving from research to product, with initiatives like Content Authenticity Initiative (CAI) aiming to address provenance.
Ethical considerations, particularly regarding copyright infringement and the unique styles of artists, require active research and industry collaboration to develop mitigation strategies like royalty sharing and content provenance tracking.
Critical skills for success in GenAI include creativity, strong engineering fundamentals, teamwork, and effective communication, alongside a deep understanding of data processing and system-level optimization.

Methods / Models / Datasets Mentioned

Autoregressive models
CLIP
CLIP score
CSS
CUDA
ChatGPT
Claude 3 Opus
CogAgent-18B
Control-A-Video
ControlNet
CrossDIT
DALL-E
DB-LoRA
DDIM Inversion
DDPM
Deepseek-VL-7B
Diffusion models
DreamBooth
EDM
Emu Video (arXiv'23)
FashionEngine
Firefly Design Model
Firefly Image 3 Model
Firefly Vector Model
Flow Matching
GANs
GPT-4V
GPT-4o
GPT3
Gemini 1.0 Pro Vision
Gen-3-alpha
Gen1
HTML
HyperHuman
IMMA
IP-Adapter
Idefics2-8B
Illustrated Instructions (CVPR'24)
ImageBind (CVPR'23)
InstructPix2Pix
JavaScript
LLAMA-3 (2024, WIP)
LLaVA 1.6-7B
LaVila (CVPR'23)
Layered Neural Atlases
LoRA
Logit-Normal Distributions
MMDIT Block
MagicBrush
MasaCtrl
MoCA (arXiv'23)
MultiDiffusion
NeRF
Network Fusion for Content Creation with Conditional INNs
Omnivore (CVPR'22)
Plug & Play Diffusion Features
Poisson reconstruction
PrimDiffusion
Rectified Flow
SD Edit
SDEdit
SMM Features
Scalable Diffusion Models with Transformers
SceneScape
Sora
Stable Diffusion
StructLDM
Structure Reference
Style Reference
StyleAligned
StyleDrop
Text-to-image models
Text-to-video models
Text2LIVE
TokenFlow
Tune-a-Video
UVIIT
VQGAN
WebSight VLM-8B
ZeroScope
sketch-rnn

Topics

3D Shape Synthesis · AI Safety · AI career skills · AI for Content Creation · Art and Technology · Content Creation · Copyright in AI · Data in AI · Design2Code · Diffusion Models · Distillation · Ethical AI · Evaluation metrics · Flow Matching · Foundation Models · Front-end engineering automation · GenAI progress · Generative AI · Image Editing · Image Generation · Large Language Models (LLMs) · Motion Transfer · Multimodal AI · Multimodal Large Language Models (MLLMs) · Multimodal Training · Rectified Flow · Research to product transition · Scaling Laws · Single-Video Models · Text-to-Image Diffusion Models · Text-to-Video · Video Editing · Video Foundation Models · Video Generation · Video generation · World simulation

Notes

Open for commentary — connections to other work, critiques, follow-up reading.