Visual Generative Modeling: What’s After Diffusion?

Event: CVPR 2025 Workshop on Visual Generative Modeling: What’s After Diffusion? · Duration: 180 min · ▶ Watch on YouTube

Abstract

This workshop session explores the landscape of generative modeling beyond traditional diffusion models, focusing on new paradigms for efficient and high-quality generation. Speakers from leading AI companies and research institutions discuss the limitations of current autoregressive and diffusion models, particularly concerning inference speed, training stability, and the ability to handle complex data distributions. Novel approaches like Inductive Moment Matching, scalable Normalizing Flows, and jump-based flow models are presented, alongside practical adaptations of diffusion models for tasks such as material control, novel view synthesis, and 4D generation. The session emphasizes the importance of an inference-first perspective and highlights the potential for models that can efficiently handle diverse data types and modalities.

Speakers

Jiaming Song — Luma AI
Jiatao Gu — Apple, UPenn
Varun Jampani — Stability AI
Arash Vahdat — NVIDIA
Ricky T. Q. Chen — Meta
Liang-Chieh (Jay) Chen — Google

Talks (6)

00:14:00 — Jiaming Song: New Pre-training Paradigms from an Inference-First Perspective
- This talk discusses new pre-training paradigms for generative models, focusing on an inference-first perspective, highlighting the limitations of current AR and diffusion models and proposing Inductive Moment Matching (IMM) as a solution for efficient and high-quality generation.
00:30:20 — Jiatao Gu: Scalable Normalizing Flows for Visual Generation
- This talk explores scalable normalizing flows for visual generation, comparing them to autoregressive and diffusion models, and highlighting their potential to address the trilemma of generative models (training stability, high quality samples, efficient inference) by offering direct invertible mappings.
01:06:00 — Varun Jampani: Diffusion Dialed In: Light and Heavy Adaptations of Diffusion Models for Complex Vision Tasks
- This talk presents various adaptations of diffusion models for complex vision tasks, including material transfer and control (Alchemist, ZeST, MARBLE), novel view synthesis (Stable Virtual Camera), and 4D generation (Stable Video 4D), showcasing their versatility and scalability.
01:31:58 — Arash Vahdat: What’s Wrong with Diffusion?
- This talk critiques diffusion models by examining their limitations in sampling speed, training efficiency, and ability to model heavy-tailed distributions, proposing solutions like Denoising Diffusion GANs (DDG) and f-Distill to accelerate sampling and improve modeling of complex data.
02:02:40 — Ricky T. Q. Chen: Unlocking Discontinuities in Flow Models: Jumps, Control Flow, Insertions, Deletions, etc
- This talk introduces a novel framework for flow models that incorporates ‘jumps’ to handle discontinuities, enabling more flexible and efficient generation of discrete data like text and code, and demonstrating its application in image captioning and code generation.
02:37:00 — Liang-Chieh (Jay) Chen: Beyond Latent Diffusion: A Journey Toward Efficient Generative Models
- This talk discusses advancements beyond latent diffusion, focusing on efficient generative models through compact 1D tokenizers (TiTok), randomized autoregressive models (RAR), and next-X prediction (xAR), demonstrating improved generation quality and speed for image and text-to-image tasks.

Key Takeaways

Current generative models, particularly AR and diffusion, face a trilemma between training stability, sample quality, and inference efficiency, necessitating new paradigms.
An ‘inference-first’ perspective can guide the development of more efficient and scalable generative algorithms, by optimizing for inference-time performance before training.
Normalizing Flows, when made scalable and invertible, offer a promising alternative to diffusion models, potentially achieving high-quality samples with stable and efficient inference.
Integrating ‘jumps’ into flow models allows for handling discontinuities in discrete data generation, opening new avenues for tasks like text and code generation with improved flexibility.
Combining elements from different generative model families (e.g., autoregressive and diffusion) and leveraging techniques like compact tokenization and randomized prediction can lead to significant improvements in generation quality and speed.

Methods / Models / Datasets Mentioned

Inductive Moment Matching (IMM)
Discrete Autoregressive (AR) Models
Discrete Diffusion
Continuous Diffusion
Vision Language Models (VLM)
Interleaved Models
Dream Machine
Chameleon
SHOW-O
Transfusion
Unified Multimodal Discrete Diffusion
BAGEL (Mixture of Transformers)
VAEs (Variational Autoencoders)
Normalizing Flows
GANs (Generative Adversarial Networks)
Diffusion Distillation
DDIM (Denoising Diffusion Implicit Models)
Euler sampler
Flow Matching
Consistency Trajectory Models
Shortcut Models
Flow Map Matching
Mean Flows for One-step Generative Modeling
Maximum Mean Discrepancy (MMD)
RKHS (Reproducing Kernel Hilbert Space)
Denoising Diffusion GANs (DDG)
f-Distill
Scalable Normalizing Flows
Autoregressive Models (AR)
Diffusion Models
VQ-VAE (Vector Quantized Variational Autoencoder)
Stable Diffusion
Alchemist (Material Transfer and Control)
ZeST (Zero-shot Material Transfer)
MARBLE (Material Recomposition and Blending)
Stable Virtual Camera (Novel View Synthesis)
Stable Video 4D (4D Generation)
ControlNet
IP-Adapter
Denoising Diffusion GANs (DDG)
f-Distill
Jump Markov Process
Continuous-time Markov Chains (CTMC)
Discrete Flow Matching
TiTok (Transformer-based 1-Dimensional Tokenizer)
RAR (Randomized Autoregressive)
xAR (Next-X Prediction)
Masked Autoencoders (MAE)
MaskGit
DIT (Diffusion Transformer)
ViT (Vision Transformer)
VQ-GAN
GPT (Generative Pre-trained Transformer)
CLIP (Contrastive Language-Image Pre-training)
Flow Matching
Meta MovieGen
Adjoint Matching
Adjoint Sampling
LSGM (Latent Space Generative Modeling)
LDM (Latent Diffusion Models)
EQ-VAE (Equivariant VAE)
VF Loss (Vision Foundation Loss)
D3PM
CTDD
LlamaGen
DART
RFdiffusion
FrameFlow
FoldFlow
ProtPardelle
ProteinGenerator
MultiFlow
BlobGEN-Vid
Denoising Diffusion GANs (DDG)
f-Distill
Truncated Consistency Distillation (TCM)
Flow Matching with General Discrete Paths
Simplified and Generalized Masked Diffusion for Discrete Data
Simple and Effective Masked Diffusion Language Models
Language-Guided Image Tokenization for Generation (TextTok)
FlexTok
LARP
UCF101

Topics

Generative Models · Diffusion Models · Autoregressive Models · Normalizing Flows · Inference Efficiency · Training Stability · Multi-modal Generation · 4D Generation · Discrete Data Generation · Text-to-Image Generation

Notes

Open for commentary — connections to other work, critiques, follow-up reading.