Visual Generative Modeling: What’s After Diffusion?

Event: CVPR 2025 Workshop on Visual Generative Modeling: What’s After Diffusion? · Duration: 180 min · ▶ Watch on YouTube

Abstract

This workshop session explores the landscape of generative modeling beyond traditional diffusion models, focusing on new paradigms for efficient and high-quality generation. Speakers from leading AI companies and research institutions discuss the limitations of current autoregressive and diffusion models, particularly concerning inference speed, training stability, and the ability to handle complex data distributions. Novel approaches like Inductive Moment Matching, scalable Normalizing Flows, and jump-based flow models are presented, alongside practical adaptations of diffusion models for tasks such as material control, novel view synthesis, and 4D generation. The session emphasizes the importance of an inference-first perspective and highlights the potential for models that can efficiently handle diverse data types and modalities.

Speakers

  • Jiaming Song — Luma AI
  • Jiatao Gu — Apple, UPenn
  • Varun Jampani — Stability AI
  • Arash Vahdat — NVIDIA
  • Ricky T. Q. Chen — Meta
  • Liang-Chieh (Jay) Chen — Google

Talks (6)

  • 00:14:00Jiaming Song: New Pre-training Paradigms from an Inference-First Perspective
    • This talk discusses new pre-training paradigms for generative models, focusing on an inference-first perspective, highlighting the limitations of current AR and diffusion models and proposing Inductive Moment Matching (IMM) as a solution for efficient and high-quality generation.
  • 00:30:20Jiatao Gu: Scalable Normalizing Flows for Visual Generation
    • This talk explores scalable normalizing flows for visual generation, comparing them to autoregressive and diffusion models, and highlighting their potential to address the trilemma of generative models (training stability, high quality samples, efficient inference) by offering direct invertible mappings.
  • 01:06:00Varun Jampani: Diffusion Dialed In: Light and Heavy Adaptations of Diffusion Models for Complex Vision Tasks
    • This talk presents various adaptations of diffusion models for complex vision tasks, including material transfer and control (Alchemist, ZeST, MARBLE), novel view synthesis (Stable Virtual Camera), and 4D generation (Stable Video 4D), showcasing their versatility and scalability.
  • 01:31:58Arash Vahdat: What’s Wrong with Diffusion?
    • This talk critiques diffusion models by examining their limitations in sampling speed, training efficiency, and ability to model heavy-tailed distributions, proposing solutions like Denoising Diffusion GANs (DDG) and f-Distill to accelerate sampling and improve modeling of complex data.
  • 02:02:40Ricky T. Q. Chen: Unlocking Discontinuities in Flow Models: Jumps, Control Flow, Insertions, Deletions, etc
    • This talk introduces a novel framework for flow models that incorporates ‘jumps’ to handle discontinuities, enabling more flexible and efficient generation of discrete data like text and code, and demonstrating its application in image captioning and code generation.
  • 02:37:00Liang-Chieh (Jay) Chen: Beyond Latent Diffusion: A Journey Toward Efficient Generative Models
    • This talk discusses advancements beyond latent diffusion, focusing on efficient generative models through compact 1D tokenizers (TiTok), randomized autoregressive models (RAR), and next-X prediction (xAR), demonstrating improved generation quality and speed for image and text-to-image tasks.

Key Takeaways

  • Current generative models, particularly AR and diffusion, face a trilemma between training stability, sample quality, and inference efficiency, necessitating new paradigms.
  • An ‘inference-first’ perspective can guide the development of more efficient and scalable generative algorithms, by optimizing for inference-time performance before training.
  • Normalizing Flows, when made scalable and invertible, offer a promising alternative to diffusion models, potentially achieving high-quality samples with stable and efficient inference.
  • Integrating ‘jumps’ into flow models allows for handling discontinuities in discrete data generation, opening new avenues for tasks like text and code generation with improved flexibility.
  • Combining elements from different generative model families (e.g., autoregressive and diffusion) and leveraging techniques like compact tokenization and randomized prediction can lead to significant improvements in generation quality and speed.

Methods / Models / Datasets Mentioned

  • Inductive Moment Matching (IMM)
  • Discrete Autoregressive (AR) Models
  • Discrete Diffusion
  • Continuous Diffusion
  • Vision Language Models (VLM)
  • Interleaved Models
  • Dream Machine
  • Chameleon
  • SHOW-O
  • Transfusion
  • Unified Multimodal Discrete Diffusion
  • BAGEL (Mixture of Transformers)
  • VAEs (Variational Autoencoders)
  • Normalizing Flows
  • GANs (Generative Adversarial Networks)
  • Diffusion Distillation
  • DDIM (Denoising Diffusion Implicit Models)
  • Euler sampler
  • Flow Matching
  • Consistency Trajectory Models
  • Shortcut Models
  • Flow Map Matching
  • Mean Flows for One-step Generative Modeling
  • Maximum Mean Discrepancy (MMD)
  • RKHS (Reproducing Kernel Hilbert Space)
  • Denoising Diffusion GANs (DDG)
  • f-Distill
  • Scalable Normalizing Flows
  • Autoregressive Models (AR)
  • Diffusion Models
  • VQ-VAE (Vector Quantized Variational Autoencoder)
  • Stable Diffusion
  • Alchemist (Material Transfer and Control)
  • ZeST (Zero-shot Material Transfer)
  • MARBLE (Material Recomposition and Blending)
  • Stable Virtual Camera (Novel View Synthesis)
  • Stable Video 4D (4D Generation)
  • ControlNet
  • IP-Adapter
  • Denoising Diffusion GANs (DDG)
  • f-Distill
  • Jump Markov Process
  • Continuous-time Markov Chains (CTMC)
  • Discrete Flow Matching
  • TiTok (Transformer-based 1-Dimensional Tokenizer)
  • RAR (Randomized Autoregressive)
  • xAR (Next-X Prediction)
  • Masked Autoencoders (MAE)
  • MaskGit
  • DIT (Diffusion Transformer)
  • ViT (Vision Transformer)
  • VQ-GAN
  • GPT (Generative Pre-trained Transformer)
  • CLIP (Contrastive Language-Image Pre-training)
  • Flow Matching
  • Meta MovieGen
  • Adjoint Matching
  • Adjoint Sampling
  • LSGM (Latent Space Generative Modeling)
  • LDM (Latent Diffusion Models)
  • EQ-VAE (Equivariant VAE)
  • VF Loss (Vision Foundation Loss)
  • D3PM
  • CTDD
  • LlamaGen
  • DART
  • RFdiffusion
  • FrameFlow
  • FoldFlow
  • ProtPardelle
  • ProteinGenerator
  • MultiFlow
  • BlobGEN-Vid
  • Denoising Diffusion GANs (DDG)
  • f-Distill
  • Truncated Consistency Distillation (TCM)
  • Flow Matching with General Discrete Paths
  • Simplified and Generalized Masked Diffusion for Discrete Data
  • Simple and Effective Masked Diffusion Language Models
  • Language-Guided Image Tokenization for Generation (TextTok)
  • FlexTok
  • LARP
  • UCF101

Topics

Generative Models · Diffusion Models · Autoregressive Models · Normalizing Flows · Inference Efficiency · Training Stability · Multi-modal Generation · 4D Generation · Discrete Data Generation · Text-to-Image Generation


Notes

Open for commentary — connections to other work, critiques, follow-up reading.