Diffusion-based Video Generative Models

Event: CVPR 2024 Tutorial · Duration: 169 min · ▶ Watch on YouTube

Abstract

This tutorial provides a comprehensive overview of diffusion-based video generative models, covering their fundamental principles, advanced generation techniques, and critical evaluation aspects. It begins by explaining the core concepts of DDPM and DDIM, highlighting their forward and reverse processes, and then delves into efficient generation methods like Latent Diffusion and the role of CLIP in multimodal generation. The tutorial then explores the diverse landscape of video generation models, from pioneering works to recent advancements in long video and multimodal generation. Finally, it addresses the significant challenges in evaluating these models, examining various quantitative and qualitative metrics, and discussing crucial considerations such as fairness, toxicity, and content moderation strategies to ensure responsible AI development.

Speakers

  • Mike Shou — National University of Singapore (NUS)
  • Deepti Ghadiyaram — Google

Talks (4)

  • 00:00:00 — Mike Shou: Outline of “Tutorial: Video Diffusion Models”
    • Introduces the tutorial’s structure, covering fundamentals, video generation, video editing, and evaluation & safety, with an emphasis on making the initial sections accessible.
  • 00:50:00Mike Shou: Fundamentals of Diffusion Models
    • Explains Denoising Diffusion Probabilistic Models (DDPM) with its forward and reverse processes, introduces Denoising Diffusion Implicit Models (DDIM) for faster generation, discusses DDIM Inversion, CLIP for bridging vision and language, and Latent Diffusion for efficiency.
  • 01:24:45Mike Shou: Video Generation
    • Provides a comprehensive overview of the video generation landscape, detailing various models and techniques including 3D convolutions, cascaded generation, and different approaches for long and multimodal video generation.
  • 03:40:10Deepti Ghadiyaram: Evaluation & Safety
    • Discusses challenges in evaluating generative models, covering quantitative metrics like Inception Score, FVD, and CLIPScore, qualitative aspects such as visual quality and realism, and critical issues like fairness, toxicity, and content moderation strategies.

Key Takeaways

  • Diffusion models are rapidly advancing, with DDIM and Latent Diffusion enabling more efficient and high-quality image and video generation.
  • Video generation is a complex task, often tackled through cascaded approaches, 3D convolutions, and specialized architectures to maintain temporal consistency and realism.
  • Evaluating generative models is challenging due to the lack of standardized, robust metrics and the subjective nature of ‘quality,’ necessitating a holistic approach beyond traditional quantitative scores.
  • Addressing ethical concerns like fairness, bias, and content moderation is crucial, as generative AI systems can perpetuate stereotypes or create harmful content, requiring proactive mitigation strategies.
  • The field is actively developing solutions for these challenges, including new benchmarks like Evalcrafter, techniques for ‘safe diffusion,’ and methods to erase unwanted concepts from models, indicating a strong focus on responsible innovation.

Methods / Models / Datasets Mentioned

  • DDPM
  • DDIM
  • CLIP
  • Latent Diffusion
  • Stable Diffusion
  • LoRA
  • DreamBooth
  • ControlNet
  • Make-A-Video
  • Imagen Video
  • Align your Latents
  • Evalcrafter
  • Show-1
  • VideoCrafter
  • ModelScopeT2V
  • Lumiere
  • DIT
  • GenTron
  • W.A.L.T.
  • Snap Video
  • Sora
  • NUWA-XL
  • VideoPoet
  • Inception Score
  • FVD
  • CLIPScore
  • DFT
  • Safe Diffusion
  • Groot
  • MAGVIT-V2
  • SoundStream
  • GLIGEN
  • MCDiff
  • AADiff
  • MM-Diffusion
  • CoDi
  • LFDM
  • Generative Dynamics
  • MiND-Video
  • LVDM
  • VideoGen
  • VidRD
  • NEXT-GPT
  • Generative Disco
  • VideoFusion
  • Latent-Shift
  • AnimateDiff
  • DSDN
  • MagicVideo
  • SimDA
  • VideoDirectorGPT
  • LLM-Grounded VDM
  • VisorGPT
  • DirecT2V
  • Free-Bloom
  • Dysgen-VDM
  • Inception-v3
  • ImageNet
  • CIFAR-10
  • WebVid-10M
  • HD-VILA-100M
  • Vimeo25M
  • UCF-101
  • MSR-VTT
  • FlintstonesHD
  • SynthID
  • RingID
  • HELM
  • Text2Video-Zero
  • PyoCo
  • MotionCtrl

Topics

Diffusion Models · Video Generation · DDPM · DDIM · Latent Diffusion · CLIP · Generative AI Evaluation · Fairness in AI · Content Moderation · Digital Watermarks


Notes

Open for commentary — connections to other work, critiques, follow-up reading.