Visual Generative Modeling: What’s After Diffusion?

Event: CVPR 2025 Workshop · Duration: 222 min · ▶ Watch on YouTube

Abstract

This workshop explores the landscape of visual generative modeling beyond current diffusion models. Speakers delve into the limitations of existing diffusion-based approaches, such as slow sampling and control challenges, while also revisiting classic generative models like GANs and normalizing flows with modern perspectives. Key discussions revolve around new methodologies for efficient and controllable generation, including one-step sampling, 4D object synthesis, and the integration of physical intrinsics. The workshop also examines the potential of language as a visual format, the pursuit of end-to-end generative modeling, and the exciting frontier of multi-modal and 3D scene creation.

Speakers

Tianhong Li — MIT
William T. Freeman — MIT and Google DeepMind
Jiajun Wu — Stanford University
Jun-Yan Zhu — Carnegie Mellon University
Phillip Isola — MIT
Kaiming He — MIT
Zhengyang Geng — Carnegie Mellon University
Robin Rombach — Black Forest Labs

Talks (7)

00:40:40 — William T. Freeman: After Diffusion Models
- Discusses limitations of current diffusion models and explores alternative generative modeling approaches like conventional graphics, genetic algorithms, and perceptually-based heuristics.
01:08:50 — Jiajun Wu: Controllable, Intuitive Generation of 4D Objects and Scenes
- Explores methods for controllable and intuitive generation of 4D objects and scenes, moving beyond 2D pixels to physical object intrinsics like shape, material, motion, and light.
01:24:30 — Jun-Yan Zhu: Still Training GANs in 2025?
- Addresses the continued relevance of GANs, highlighting their challenges and recent advancements in training strategies, including differentiable augmentation and vision-aided discriminators.
01:51:00 — Phillip Isola: Language as a Visual Format
- Challenges the pixel-centric view of visual formats, proposing language as a first-class way to convey visual information and introducing a cycle-consistency reward model for image-text alignment.
02:20:20 — Kaiming He: Towards End-to-End Generative Modeling
- Discusses the historical shift to end-to-end training in recognition models and observes a similar pattern emerging in generative models, exploring flow matching and neural ODEs as paths towards end-to-end generative modeling.
03:00:20 — Zhengyang Geng: Mean Flows for One-step Generative Modeling
- Introduces ‘Mean Flows’ as a method for one-step generative modeling, building upon flow matching to learn an average velocity field that maps noise to data in a single step.
03:02:30 — Robin Rombach: What’s after Diffusion?
- Explores future directions beyond vanilla latent diffusion models, focusing on architectural innovations, novel training strategies, in-context image synthesis, multi-modality, and 3D scene generation.

Key Takeaways

The workshop highlights a shift in generative modeling research towards addressing the limitations of current diffusion models, particularly concerning speed, control, and interpretability.
There’s a growing interest in exploring alternative or complementary generative paradigms, including classical models like GANs and normalizing flows, as well as novel approaches like flow matching and neural ODEs, to achieve more efficient and controllable generation.
The concept of ‘end-to-end’ generative modeling is gaining traction, aiming to integrate various stages of generation (e.g., from text to 3D scenes) into unified, differentiable frameworks, potentially leveraging physical intrinsics and multi-modal inputs.
Language is being re-evaluated as a powerful visual format, with research focusing on improving image-text alignment and using detailed textual descriptions to convey rich visual information, offering an alternative to pixel-based representations.
The community is actively working on scaling up generative models, improving their training stability, and exploring their application in complex scenarios like in-context image synthesis, multi-modal generation (audio, video, text), and real-time 3D scene creation.

Methods / Models / Datasets Mentioned

Diffusion Models
Conventional Graphics
Genetic Algorithms
Simple Explainable Algorithms
Perceptually-based Heuristics
Neural Networks
Normalizing Flows
Autoregressive Models
Consistency Models
Image-to-Image Translation
Generative Adversarial Networks (GANs)
StyleGAN2-ADA
Projected GANs
Perceptual Discriminators
CLIP
DINO
Mean Flows
Neural Ordinary Differential Equations (ODEs)
Flow Matching
DDPM
REPA
REPA-E
FLUX.1 Kontext
LADD
MMaudio
Genie 2
World Labs
SpAtial
JetFormer
Diffusion Autoencoders
Scalable Image Tokenizers
CycleReward
DPO
Pix2Pix Turbo
CycleGAN
CycleGAN-turbo
VideoJam
ReDI
DIT/SIT
GPT-Image-1
Gen-4
HiDream E1
BAGEL
SDXL
DALL-E 2
DALL-E 3
SBERT
DreamSim
VQA Score
PickScore
ImageReward
HPSv2
CLIPScore
SD 1.5 Teacher
2-step LCM-LoRA
InstaFlow-0.9B
Diffusion2GAN
ICT-XL/2
Shortcut-XL/2
iMM-XL/2
MeanFlow-XL/2
ADM
LDM
SimDiff
DiT-XL/2
SIT-XL/2+REPA
BigGAN
GigaGAN
StyleGAN-T
AR w/ VQGAN
MaskGIT
ViT-d30
MAR-H
Aurora
R3GAN
PhysDreamer

Topics

Visual Generative Modeling · Diffusion Models · Flow Matching · End-to-End Generative Modeling · 4D Object and Scene Generation · Language as a Visual Format · GANs · Multi-modality · 3D Scene Generation · Latent Space Optimization

Notes

Open for commentary — connections to other work, critiques, follow-up reading.