CVPR 2024 Workshop

Event: CVPR 2024 Workshop · Duration: 177 min · ▶ Watch on YouTube

Abstract

This workshop explores the cutting-edge advancements in generative AI within computer vision, focusing on the SyntaGen Challenge and broader implications of synthetic data. Presentations cover novel methods for generating high-quality synthetic datasets for semantic segmentation, benchmarking the robustness of models trained on synthetic images, and extracting intrinsic knowledge from generative models. A significant portion of the workshop delves into the potential of text-to-image and text-to-video models, showcasing techniques for achieving fine-grained control over generated content, enhancing consistency in video editing, and exploring the intriguing concept of learning visual representations from non-visual data modalities like math, code, and language.

Speakers

Minh-Tuan Huynh — Ho Chi Minh City University of Science, Vietnam
Felix Stillger — Bergische Universität Wuppertal
Krishnakant Singh — TU Darmstadt, hessian.AI
Xiaodan Du — Toyota Technological Institute at Chicago, Adobe
David J Fleet — Google DeepMind, University of Toronto, Vector Institute
Jia-Bin Huang — University of Maryland College Park
Phillip Isola — MIT

Talks (7)

00:37:30 — Minh-Tuan Huynh: Synthetic Is All You Need For Semantic Segmentation
- This talk presents the Teddy Bear Team’s solution for the SyntaGen Challenge, focusing on generating synthetic datasets for semantic segmentation using CLIP Interrogator and Stable Diffusion 1.5, enhanced with a multi-label classifier to filter out redundant masks.
00:57:40 — Felix Stillger: Principal Component Clustering for Semantic Segmentation in Synthetic Data Generation
- This presentation introduces a method for semantic segmentation in synthetic data generation using Principal Component Clustering (PCC) applied to self-attention maps from Stable Diffusion, combined with Open Vocabulary Attention Maps for class assignment.
01:41:50 — Krishnakant Singh: Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images
- This talk investigates the robustness of models trained on synthetic data (synthetic clones) across various measures, comparing them to models trained on real data and highlighting limitations in adversarial robustness, common corruptions, OOD detection, and calibration.
02:28:50 — Xiaodan Du: INTRINSIC LORA: A GENERALIST APPROACH FOR DISCOVERING KNOWLEDGE IN GENERATIVE MODELS
- This presentation introduces Intrinsic-LoRA (I-LoRA), a novel LoRA-based method capable of efficiently extracting various scene intrinsics (like surface normals, depth, albedo, shading) across a broad spectrum of generative models, demonstrating its high efficiency and the correlation between extracted intrinsics’ quality and the generative model’s visual quality.
03:00:00 — David J Fleet: Promising Generative Data Augmentation
- This talk explores the potential of generative models for data augmentation, highlighting the historical progression from early image generation to modern diffusion models, and demonstrating how synthetic data can improve performance on downstream tasks like ImageNet classification, especially when combined with real data.
03:40:00 — Jia-Bin Huang: Zero-Shot Aligned Text-to-Image Synthesis
- This presentation introduces a zero-shot aligned text-to-image synthesis method that leverages large language models (LLMs) for spatial layout reasoning and attention refocusing to generate images with precise object placement and compositional control, demonstrating its effectiveness in generating complex scenes and achieving high consistency in video editing tasks.
04:00:00 — Phillip Isola: Learning Vision without Visual Data
- This talk explores the concept of “learning vision without visual data” by training models on non-visual data like mathematical equations, code, and natural language, demonstrating that these models can learn meaningful visual representations and perform well on tasks like ImageNet classification and human drawing recognition, suggesting a “Platonic Representation Hypothesis” where underlying visual structure can be universally encoded across different forms of data.

Key Takeaways

Synthetic data generated by advanced models like Stable Diffusion can achieve competitive performance in semantic segmentation, especially when augmented with classification models to refine masks.
The quality of extracted scene intrinsics (e.g., depth, normals) from generative models is directly correlated with the visual quality of the generative model itself, suggesting a deeper understanding of scene structure.
While synthetic data offers significant advantages, models trained solely on it may exhibit limitations in robustness to adversarial perturbations, common corruptions, and out-of-distribution detection compared to real data.
Text-to-video models are rapidly advancing, enabling the generation of complex, consistent, and controllable dynamic scenes, moving beyond simple image generation to create plausible video content.
Visual representations can be learned from non-visual data modalities like mathematical equations, code, and natural language, suggesting that underlying visual structure might be universally encoded across different forms of data.

Methods / Models / Datasets Mentioned

CLIP Interrogator
Stable Diffusion 1.5
DeepLabv3
CLIP-ES
Principal Component Clustering (PCC)
Open Vocabulary Attention Maps (OVAM)
Intrinsic-LoRA (I-LoRA)
ResNet-50
VQ-VAE-2
BigGAN-deep
DALL-E 3
Midjourney
Emu (Meta)
Imagen (Google)
Firefly (Adobe)
SDXL (Stability.ai)
Sora (OpenAI)
Instant3D
Hertz (CVPR 2024)
MultiDiffusion (ICML 2023)
TokenFlow (ICLR 2024)
SceneScape (NeurIPS 2023)
DDIM Inversion
ResNet Block
SMM (Spatial Marginal Mean) Features
Exemplar VAE
VAE with Gaussian Prior
VAE with VampPrior
kNN classifier
ResNet-50
DeiT-S
DeiT-B
DeiT-L
R-50
R-101
R-152
DINO-v2
SynCLR
SynCLIP
StyleGAN v2
StyleGAN-XL
VQGAN
Pix2Video
Control-A-Video
Gen-1
Tune-A-Video
Layered Neural Atlases
Omnimatte
DDIM Inversion
CLIP text encoder
ResNet
Temporal ResNet
Spatial-attention
Cross-attention
Temporal-attention
Transformer
Objaverse
MVImgNet
ZeroVerse
3DGS (3D Gaussian Splatting)
LRM (Large Reconstruction Model)
PFLRM (Pose-Free Large Reconstruction Model)
DMVD (Denoising Multi-View Diffusion)
ZeroScope

Topics

Generative AI · Synthetic Data Generation · Semantic Segmentation · Model Robustness · Intrinsic Feature Extraction · Text-to-Image Synthesis · Text-to-Video Models · Vision Learning without Visual Data · AI Pipeline Optimization · Data Augmentation

Notes

Open for commentary — connections to other work, critiques, follow-up reading.