Diffusion-based Video Generative Models

Event: CVPR 2024 Tutorial · Duration: 169 min · ▶ Watch on YouTube

Abstract

This tutorial provides a comprehensive overview of diffusion-based video generative models, covering their fundamental principles, advanced generation techniques, and critical evaluation aspects. It begins by explaining the core concepts of DDPM and DDIM, highlighting their forward and reverse processes, and then delves into efficient generation methods like Latent Diffusion and the role of CLIP in multimodal generation. The tutorial then explores the diverse landscape of video generation models, from pioneering works to recent advancements in long video and multimodal generation. Finally, it addresses the significant challenges in evaluating these models, examining various quantitative and qualitative metrics, and discussing crucial considerations such as fairness, toxicity, and content moderation strategies to ensure responsible AI development.

Speakers

Mike Shou — National University of Singapore (NUS)
Deepti Ghadiyaram — Google

Talks (4)

00:00:00 — Mike Shou: Outline of “Tutorial: Video Diffusion Models”
- Introduces the tutorial’s structure, covering fundamentals, video generation, video editing, and evaluation & safety, with an emphasis on making the initial sections accessible.
00:50:00 — Mike Shou: Fundamentals of Diffusion Models
- Explains Denoising Diffusion Probabilistic Models (DDPM) with its forward and reverse processes, introduces Denoising Diffusion Implicit Models (DDIM) for faster generation, discusses DDIM Inversion, CLIP for bridging vision and language, and Latent Diffusion for efficiency.
01:24:45 — Mike Shou: Video Generation
- Provides a comprehensive overview of the video generation landscape, detailing various models and techniques including 3D convolutions, cascaded generation, and different approaches for long and multimodal video generation.
03:40:10 — Deepti Ghadiyaram: Evaluation & Safety
- Discusses challenges in evaluating generative models, covering quantitative metrics like Inception Score, FVD, and CLIPScore, qualitative aspects such as visual quality and realism, and critical issues like fairness, toxicity, and content moderation strategies.

Key Takeaways

Diffusion models are rapidly advancing, with DDIM and Latent Diffusion enabling more efficient and high-quality image and video generation.
Video generation is a complex task, often tackled through cascaded approaches, 3D convolutions, and specialized architectures to maintain temporal consistency and realism.
Evaluating generative models is challenging due to the lack of standardized, robust metrics and the subjective nature of ‘quality,’ necessitating a holistic approach beyond traditional quantitative scores.
Addressing ethical concerns like fairness, bias, and content moderation is crucial, as generative AI systems can perpetuate stereotypes or create harmful content, requiring proactive mitigation strategies.
The field is actively developing solutions for these challenges, including new benchmarks like Evalcrafter, techniques for ‘safe diffusion,’ and methods to erase unwanted concepts from models, indicating a strong focus on responsible innovation.

Methods / Models / Datasets Mentioned

DDPM
DDIM
CLIP
Latent Diffusion
Stable Diffusion
LoRA
DreamBooth
ControlNet
Make-A-Video
Imagen Video
Align your Latents
Evalcrafter
Show-1
VideoCrafter
ModelScopeT2V
Lumiere
DIT
GenTron
W.A.L.T.
Snap Video
Sora
NUWA-XL
VideoPoet
Inception Score
FVD
CLIPScore
DFT
Safe Diffusion
Groot
MAGVIT-V2
SoundStream
GLIGEN
MCDiff
AADiff
MM-Diffusion
CoDi
LFDM
Generative Dynamics
MiND-Video
LVDM
VideoGen
VidRD
NEXT-GPT
Generative Disco
VideoFusion
Latent-Shift
AnimateDiff
DSDN
MagicVideo
SimDA
VideoDirectorGPT
LLM-Grounded VDM
VisorGPT
DirecT2V
Free-Bloom
Dysgen-VDM
Inception-v3
ImageNet
CIFAR-10
WebVid-10M
HD-VILA-100M
Vimeo25M
UCF-101
MSR-VTT
FlintstonesHD
SynthID
RingID
HELM
Text2Video-Zero
PyoCo
MotionCtrl

Topics

Diffusion Models · Video Generation · DDPM · DDIM · Latent Diffusion · CLIP · Generative AI Evaluation · Fairness in AI · Content Moderation · Digital Watermarks

Notes

Open for commentary — connections to other work, critiques, follow-up reading.