GenAI Media Generation Challenge Workshop @ CVPR

Event: CVPR 2024 Workshop · Duration: 264 min · ▶ Watch on YouTube

Abstract

The MAGIC workshop at CVPR 2024 brought together leading researchers to discuss the latest advancements and challenges in generative AI for media creation. The workshop featured presentations on diverse topics, including improving FID/FVD metrics, historical and future perspectives on video generation, ethical considerations in text-to-image models, and novel approaches to controllable media generation. Key themes included the need for robust benchmarking, efficient multi-step diffusion models, and the integration of large language models for enhanced control. The panel discussion further explored the future of multimodal generative AI, the importance of data attribution, and the dynamic relationship between academic research and industrial applications in this rapidly evolving field.

Speakers

Jun-Yan Zhu — CMU
Richard Zhang — Adobe
Ishan Misra — Meta, GenAI
Kevin Chih-Yao Ma — Meta, GenAI
Yuanzhen Li — Google
Tim Salimans — Google DeepMind
Akio Kodaira — Berkeley
Chenfeng Xu — Berkeley
Manuel Brack — DFKI AI
Bingliang Zhang — The Chinese University of Hong Kong
Junhao Zhuang — Tencent PCG
Zhaoyang Zhang — The Chinese University of Hong Kong
Yuxuan Bian — The Chinese University of Hong Kong
Qiang Xu — The Chinese University of Hong Kong

Talks (12)

00:00:00 — Kevin Chih-Yao Ma: MAGIC Workshop Introduction
- Introduction to the MAGIC workshop, its goals, and the challenges in generative media benchmarking and evaluation.
01:59:59 — Jun-Yan Zhu: Known Issues with FID and FVD
- Discussion of known issues and limitations of FID and FVD metrics in evaluating generative models, including Gaussian assumptions, sample size requirements, and sensitivity to image processing details.
04:17:50 — Sergey Tulyakov: Video Generation: Past, Present, and a New Hope
- A historical overview of video generation, from early human attempts to modern diffusion models, highlighting the rapid progress and future potential of the field.
05:51:59 — Richard Zhang: Incentivizing Opt-in & Enabling Opt-out for Text-to-Image Models
- Exploration of methods to customize and control text-to-image diffusion models, including adding/removing concepts, addressing problematic compositions, and leveraging data attribution for ethical AI.
07:01:59 — Yuanzhen Li: Controllable Media Generation
- Overview of methods for controllable media generation, including subject, style, shape, material, and composition control, with examples from DreamBooth, StyleDrop, and Alchemist.
08:51:59 — Tim Salimans: Multistep Distillation of Diffusion Models via Moment Matching
- Introduction to Moment Matching distillation as a method to speed up diffusion models, achieving faster sampling while maintaining high quality, and outperforming teacher models in FID scores.
09:59:59 — Akio Kodaira: StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation
- Introduction of StreamDiffusion, a pipeline-level solution for real-time interactive image generation, utilizing stream batch processing and future frame attention to achieve high throughput and temporal consistency.
11:16:59 — Manuel Brack: LEdits++ - Limitless Image Editing
- Introduction to LEdits++, a zero-shot image editing approach that allows for versatile and precise edits without fine-tuning, leveraging perfect inversion and semantic grounding for high-quality results.
12:21:59 — Bingliang Zhang: Instruction-Guided Image Editing
- Presentation of a novel instruction-guided image editing method that combines large language models with diffusion models to achieve precise and versatile edits, outperforming previous solutions in visual quality and instruction following.
13:32:00 — Junhao Zhuang: Media Generation Challenge Results
- Presentation of the MAGIC T2I Benchmark 4k, detailing its construction, evaluation methodology, and the results of the challenge, highlighting the performance of top models and the remaining gap with commercial solutions.
13:44:00 — Kevin Chih-Yao Ma: Panel Discussion
- Panel discussion on the future of generative models, focusing on multimodal integration, data attribution, and the interplay between academic research and industrial applications.
15:22:00 — Kevin Chih-Yao Ma: Closing Remarks
- Closing remarks for the workshop, summarizing key insights from the panel discussion and outlining future plans for the MAGIC challenge, including technical reports, leadership board updates, and open-sourcing annotations.

Key Takeaways

Diffusion models are becoming increasingly multimodal, integrating various data types like images, video, audio, and language for more efficient and comprehensive representations of the world.
The field of generative AI is rapidly evolving, with new models and techniques emerging constantly, making it challenging to establish universal benchmarks and evaluation frameworks.
There’s a growing need for robust and standardized evaluation metrics for generative models, moving beyond traditional FID/FVD to address issues like Gaussian assumptions, sample size, and sensitivity to image processing details.
Ethical considerations, including data attribution, copyright, and the ability for creators to opt-in or opt-out of model training, are becoming central to the development and deployment of generative AI.
Future research directions include developing more efficient multi-step diffusion models, exploring controllable media generation across various attributes, and leveraging self-supervised learning for improved temporal consistency in video generation.

Methods / Models / Datasets Mentioned

GANs
DC-GANs
BigGAN
VQ-VAE
DF-GAN
GigaGAN
DRAW
PixelCNN
Image GPT
DDPM
LDM
Midjourney
EmuEdit
Emu
Stable Diffusion
Stable Diffusion 3
Imagen
Parti
Muse
Flow Matching
Dall-E 2
Dall-E 3
MNIST
CIFAR
CUB/MS-COCO
T2I CompBench
TIFA
Drawbench
ImageGen
GlyphControl Text Benchmark
TextDiffuser MARIOEval
EmuEdit
PartiPrompts
DesignBench
CLIP
FID-CLIP
I3D
VDM++
RIN
MultiStep-CD
CTM
DMD
Diff-Instruct
PerFlow
UFO-Gen
SwiftBrush
InstaFlow-1.7B
DreamBooth
HyperDreamBooth
StyleDrop
ZipLoRA
Muse
Lumiere
RealFill
DreamBooth3D
Alchemist
DreamFusion
Flag-DiT
Lumina-T2X
Lumina-Next
Next-DiT
Lumina-Next-SFT
LEdits++
InstructPix2Pix
SmartEdit
Infinifit
BrushNet
Grounded-SAM
Tasvir
Lumina
StreamDiffusion
DALLE3
Stable Diffusion 3.0
LAION Dataset
Custom Diffusion
Textual Inversion
Dreambooth
HyperDreamBooth
BLIPDiffusion
SUTI
E4T-Diffusion
IP-Adapter
AnyDoor
FastComposer
ZipLoRA
SVDDiff
P+
Cones 2
VQGAN
MaskGIT
SDXL
Emu Video
DINO
MoCo
ALADIN
SSCD

Topics

Generative AI · Media Generation · FID/FVD Metrics · Video Generation · Text-to-Image Models · Controllable Generation · Diffusion Models · Multi-step Distillation · Data Attribution · Ethical AI

Notes

Open for commentary — connections to other work, critiques, follow-up reading.