GenAI Media Generation Challenge Workshop @ CVPR
Event: CVPR 2024 Workshop · Duration: 264 min · ▶ Watch on YouTube
Abstract
The MAGIC workshop at CVPR 2024 brought together leading researchers to discuss the latest advancements and challenges in generative AI for media creation. The workshop featured presentations on diverse topics, including improving FID/FVD metrics, historical and future perspectives on video generation, ethical considerations in text-to-image models, and novel approaches to controllable media generation. Key themes included the need for robust benchmarking, efficient multi-step diffusion models, and the integration of large language models for enhanced control. The panel discussion further explored the future of multimodal generative AI, the importance of data attribution, and the dynamic relationship between academic research and industrial applications in this rapidly evolving field.
Speakers
- Jun-Yan Zhu — CMU
- Richard Zhang — Adobe
- Ishan Misra — Meta, GenAI
- Kevin Chih-Yao Ma — Meta, GenAI
- Yuanzhen Li — Google
- Tim Salimans — Google DeepMind
- Akio Kodaira — Berkeley
- Chenfeng Xu — Berkeley
- Manuel Brack — DFKI AI
- Bingliang Zhang — The Chinese University of Hong Kong
- Junhao Zhuang — Tencent PCG
- Zhaoyang Zhang — The Chinese University of Hong Kong
- Yuxuan Bian — The Chinese University of Hong Kong
- Qiang Xu — The Chinese University of Hong Kong
Talks (12)
- 00:00:00 — Kevin Chih-Yao Ma: MAGIC Workshop Introduction
- Introduction to the MAGIC workshop, its goals, and the challenges in generative media benchmarking and evaluation.
- 01:59:59 — Jun-Yan Zhu: Known Issues with FID and FVD
- Discussion of known issues and limitations of FID and FVD metrics in evaluating generative models, including Gaussian assumptions, sample size requirements, and sensitivity to image processing details.
- 04:17:50 — Sergey Tulyakov: Video Generation: Past, Present, and a New Hope
- A historical overview of video generation, from early human attempts to modern diffusion models, highlighting the rapid progress and future potential of the field.
- 05:51:59 — Richard Zhang: Incentivizing Opt-in & Enabling Opt-out for Text-to-Image Models
- Exploration of methods to customize and control text-to-image diffusion models, including adding/removing concepts, addressing problematic compositions, and leveraging data attribution for ethical AI.
- 07:01:59 — Yuanzhen Li: Controllable Media Generation
- Overview of methods for controllable media generation, including subject, style, shape, material, and composition control, with examples from DreamBooth, StyleDrop, and Alchemist.
- 08:51:59 — Tim Salimans: Multistep Distillation of Diffusion Models via Moment Matching
- Introduction to Moment Matching distillation as a method to speed up diffusion models, achieving faster sampling while maintaining high quality, and outperforming teacher models in FID scores.
- 09:59:59 — Akio Kodaira: StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation
- Introduction of StreamDiffusion, a pipeline-level solution for real-time interactive image generation, utilizing stream batch processing and future frame attention to achieve high throughput and temporal consistency.
- 11:16:59 — Manuel Brack: LEdits++ - Limitless Image Editing
- Introduction to LEdits++, a zero-shot image editing approach that allows for versatile and precise edits without fine-tuning, leveraging perfect inversion and semantic grounding for high-quality results.
- 12:21:59 — Bingliang Zhang: Instruction-Guided Image Editing
- Presentation of a novel instruction-guided image editing method that combines large language models with diffusion models to achieve precise and versatile edits, outperforming previous solutions in visual quality and instruction following.
- 13:32:00 — Junhao Zhuang: Media Generation Challenge Results
- Presentation of the MAGIC T2I Benchmark 4k, detailing its construction, evaluation methodology, and the results of the challenge, highlighting the performance of top models and the remaining gap with commercial solutions.
- 13:44:00 — Kevin Chih-Yao Ma: Panel Discussion
- Panel discussion on the future of generative models, focusing on multimodal integration, data attribution, and the interplay between academic research and industrial applications.
- 15:22:00 — Kevin Chih-Yao Ma: Closing Remarks
- Closing remarks for the workshop, summarizing key insights from the panel discussion and outlining future plans for the MAGIC challenge, including technical reports, leadership board updates, and open-sourcing annotations.
Key Takeaways
- Diffusion models are becoming increasingly multimodal, integrating various data types like images, video, audio, and language for more efficient and comprehensive representations of the world.
- The field of generative AI is rapidly evolving, with new models and techniques emerging constantly, making it challenging to establish universal benchmarks and evaluation frameworks.
- There’s a growing need for robust and standardized evaluation metrics for generative models, moving beyond traditional FID/FVD to address issues like Gaussian assumptions, sample size, and sensitivity to image processing details.
- Ethical considerations, including data attribution, copyright, and the ability for creators to opt-in or opt-out of model training, are becoming central to the development and deployment of generative AI.
- Future research directions include developing more efficient multi-step diffusion models, exploring controllable media generation across various attributes, and leveraging self-supervised learning for improved temporal consistency in video generation.
Methods / Models / Datasets Mentioned
GANsDC-GANsBigGANVQ-VAEDF-GANGigaGANDRAWPixelCNNImage GPTDDPMLDMMidjourneyEmuEditEmuStable DiffusionStable Diffusion 3ImagenPartiMuseFlow MatchingDall-E 2Dall-E 3MNISTCIFARCUB/MS-COCOT2I CompBenchTIFADrawbenchImageGenGlyphControl Text BenchmarkTextDiffuser MARIOEvalEmuEditPartiPromptsDesignBenchCLIPFID-CLIPI3DVDM++RINMultiStep-CDCTMDMDDiff-InstructPerFlowUFO-GenSwiftBrushInstaFlow-1.7BDreamBoothHyperDreamBoothStyleDropZipLoRAMuseLumiereRealFillDreamBooth3DAlchemistDreamFusionFlag-DiTLumina-T2XLumina-NextNext-DiTLumina-Next-SFTLEdits++InstructPix2PixSmartEditInfinifitBrushNetGrounded-SAMTasvirLuminaStreamDiffusionDALLE3Stable Diffusion 3.0LAION DatasetCustom DiffusionTextual InversionDreamboothHyperDreamBoothBLIPDiffusionSUTIE4T-DiffusionIP-AdapterAnyDoorFastComposerZipLoRASVDDiffP+Cones 2VQGANMaskGITSDXLEmu VideoDINOMoCoALADINSSCD
Topics
Generative AI · Media Generation · FID/FVD Metrics · Video Generation · Text-to-Image Models · Controllable Generation · Diffusion Models · Multi-step Distillation · Data Attribution · Ethical AI
Notes
Open for commentary — connections to other work, critiques, follow-up reading.