Visual Generative Modeling: What’s After Diffusion?
Event: CVPR 2025 Workshop · Duration: 222 min · ▶ Watch on YouTube
Abstract
This workshop explores the landscape of visual generative modeling beyond current diffusion models. Speakers delve into the limitations of existing diffusion-based approaches, such as slow sampling and control challenges, while also revisiting classic generative models like GANs and normalizing flows with modern perspectives. Key discussions revolve around new methodologies for efficient and controllable generation, including one-step sampling, 4D object synthesis, and the integration of physical intrinsics. The workshop also examines the potential of language as a visual format, the pursuit of end-to-end generative modeling, and the exciting frontier of multi-modal and 3D scene creation.
Speakers
- Tianhong Li — MIT
- William T. Freeman — MIT and Google DeepMind
- Jiajun Wu — Stanford University
- Jun-Yan Zhu — Carnegie Mellon University
- Phillip Isola — MIT
- Kaiming He — MIT
- Zhengyang Geng — Carnegie Mellon University
- Robin Rombach — Black Forest Labs
Talks (7)
- 00:40:40 — William T. Freeman: After Diffusion Models
- Discusses limitations of current diffusion models and explores alternative generative modeling approaches like conventional graphics, genetic algorithms, and perceptually-based heuristics.
- 01:08:50 — Jiajun Wu: Controllable, Intuitive Generation of 4D Objects and Scenes
- Explores methods for controllable and intuitive generation of 4D objects and scenes, moving beyond 2D pixels to physical object intrinsics like shape, material, motion, and light.
- 01:24:30 — Jun-Yan Zhu: Still Training GANs in 2025?
- Addresses the continued relevance of GANs, highlighting their challenges and recent advancements in training strategies, including differentiable augmentation and vision-aided discriminators.
- 01:51:00 — Phillip Isola: Language as a Visual Format
- Challenges the pixel-centric view of visual formats, proposing language as a first-class way to convey visual information and introducing a cycle-consistency reward model for image-text alignment.
- 02:20:20 — Kaiming He: Towards End-to-End Generative Modeling
- Discusses the historical shift to end-to-end training in recognition models and observes a similar pattern emerging in generative models, exploring flow matching and neural ODEs as paths towards end-to-end generative modeling.
- 03:00:20 — Zhengyang Geng: Mean Flows for One-step Generative Modeling
- Introduces ‘Mean Flows’ as a method for one-step generative modeling, building upon flow matching to learn an average velocity field that maps noise to data in a single step.
- 03:02:30 — Robin Rombach: What’s after Diffusion?
- Explores future directions beyond vanilla latent diffusion models, focusing on architectural innovations, novel training strategies, in-context image synthesis, multi-modality, and 3D scene generation.
Key Takeaways
- The workshop highlights a shift in generative modeling research towards addressing the limitations of current diffusion models, particularly concerning speed, control, and interpretability.
- There’s a growing interest in exploring alternative or complementary generative paradigms, including classical models like GANs and normalizing flows, as well as novel approaches like flow matching and neural ODEs, to achieve more efficient and controllable generation.
- The concept of ‘end-to-end’ generative modeling is gaining traction, aiming to integrate various stages of generation (e.g., from text to 3D scenes) into unified, differentiable frameworks, potentially leveraging physical intrinsics and multi-modal inputs.
- Language is being re-evaluated as a powerful visual format, with research focusing on improving image-text alignment and using detailed textual descriptions to convey rich visual information, offering an alternative to pixel-based representations.
- The community is actively working on scaling up generative models, improving their training stability, and exploring their application in complex scenarios like in-context image synthesis, multi-modal generation (audio, video, text), and real-time 3D scene creation.
Methods / Models / Datasets Mentioned
Diffusion ModelsConventional GraphicsGenetic AlgorithmsSimple Explainable AlgorithmsPerceptually-based HeuristicsNeural NetworksNormalizing FlowsAutoregressive ModelsConsistency ModelsImage-to-Image TranslationGenerative Adversarial Networks (GANs)StyleGAN2-ADAProjected GANsPerceptual DiscriminatorsCLIPDINOMean FlowsNeural Ordinary Differential Equations (ODEs)Flow MatchingDDPMREPAREPA-EFLUX.1 KontextLADDMMaudioGenie 2World LabsSpAtialJetFormerDiffusion AutoencodersScalable Image TokenizersCycleRewardDPOPix2Pix TurboCycleGANCycleGAN-turboVideoJamReDIDIT/SITGPT-Image-1Gen-4HiDream E1BAGELSDXLDALL-E 2DALL-E 3SBERTDreamSimVQA ScorePickScoreImageRewardHPSv2CLIPScoreSD 1.5 Teacher2-step LCM-LoRAInstaFlow-0.9BDiffusion2GANICT-XL/2Shortcut-XL/2iMM-XL/2MeanFlow-XL/2ADMLDMSimDiffDiT-XL/2SIT-XL/2+REPABigGANGigaGANStyleGAN-TAR w/ VQGANMaskGITViT-d30MAR-HAuroraR3GANPhysDreamer
Topics
Visual Generative Modeling · Diffusion Models · Flow Matching · End-to-End Generative Modeling · 4D Object and Scene Generation · Language as a Visual Format · GANs · Multi-modality · 3D Scene Generation · Latent Space Optimization
Notes
Open for commentary — connections to other work, critiques, follow-up reading.