CVPR 2024 Workshop

Event: CVPR 2024 Workshop · Duration: 177 min · ▶ Watch on YouTube

Abstract

This workshop explores the cutting-edge advancements in generative AI within computer vision, focusing on the SyntaGen Challenge and broader implications of synthetic data. Presentations cover novel methods for generating high-quality synthetic datasets for semantic segmentation, benchmarking the robustness of models trained on synthetic images, and extracting intrinsic knowledge from generative models. A significant portion of the workshop delves into the potential of text-to-image and text-to-video models, showcasing techniques for achieving fine-grained control over generated content, enhancing consistency in video editing, and exploring the intriguing concept of learning visual representations from non-visual data modalities like math, code, and language.

Speakers

  • Minh-Tuan Huynh — Ho Chi Minh City University of Science, Vietnam
  • Felix Stillger — Bergische Universität Wuppertal
  • Krishnakant Singh — TU Darmstadt, hessian.AI
  • Xiaodan Du — Toyota Technological Institute at Chicago, Adobe
  • David J Fleet — Google DeepMind, University of Toronto, Vector Institute
  • Jia-Bin Huang — University of Maryland College Park
  • Phillip Isola — MIT

Talks (7)

  • 00:37:30Minh-Tuan Huynh: Synthetic Is All You Need For Semantic Segmentation
    • This talk presents the Teddy Bear Team’s solution for the SyntaGen Challenge, focusing on generating synthetic datasets for semantic segmentation using CLIP Interrogator and Stable Diffusion 1.5, enhanced with a multi-label classifier to filter out redundant masks.
  • 00:57:40Felix Stillger: Principal Component Clustering for Semantic Segmentation in Synthetic Data Generation
    • This presentation introduces a method for semantic segmentation in synthetic data generation using Principal Component Clustering (PCC) applied to self-attention maps from Stable Diffusion, combined with Open Vocabulary Attention Maps for class assignment.
  • 01:41:50Krishnakant Singh: Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images
    • This talk investigates the robustness of models trained on synthetic data (synthetic clones) across various measures, comparing them to models trained on real data and highlighting limitations in adversarial robustness, common corruptions, OOD detection, and calibration.
  • 02:28:50Xiaodan Du: INTRINSIC LORA: A GENERALIST APPROACH FOR DISCOVERING KNOWLEDGE IN GENERATIVE MODELS
    • This presentation introduces Intrinsic-LoRA (I-LoRA), a novel LoRA-based method capable of efficiently extracting various scene intrinsics (like surface normals, depth, albedo, shading) across a broad spectrum of generative models, demonstrating its high efficiency and the correlation between extracted intrinsics’ quality and the generative model’s visual quality.
  • 03:00:00David J Fleet: Promising Generative Data Augmentation
    • This talk explores the potential of generative models for data augmentation, highlighting the historical progression from early image generation to modern diffusion models, and demonstrating how synthetic data can improve performance on downstream tasks like ImageNet classification, especially when combined with real data.
  • 03:40:00Jia-Bin Huang: Zero-Shot Aligned Text-to-Image Synthesis
    • This presentation introduces a zero-shot aligned text-to-image synthesis method that leverages large language models (LLMs) for spatial layout reasoning and attention refocusing to generate images with precise object placement and compositional control, demonstrating its effectiveness in generating complex scenes and achieving high consistency in video editing tasks.
  • 04:00:00Phillip Isola: Learning Vision without Visual Data
    • This talk explores the concept of “learning vision without visual data” by training models on non-visual data like mathematical equations, code, and natural language, demonstrating that these models can learn meaningful visual representations and perform well on tasks like ImageNet classification and human drawing recognition, suggesting a “Platonic Representation Hypothesis” where underlying visual structure can be universally encoded across different forms of data.

Key Takeaways

  • Synthetic data generated by advanced models like Stable Diffusion can achieve competitive performance in semantic segmentation, especially when augmented with classification models to refine masks.
  • The quality of extracted scene intrinsics (e.g., depth, normals) from generative models is directly correlated with the visual quality of the generative model itself, suggesting a deeper understanding of scene structure.
  • While synthetic data offers significant advantages, models trained solely on it may exhibit limitations in robustness to adversarial perturbations, common corruptions, and out-of-distribution detection compared to real data.
  • Text-to-video models are rapidly advancing, enabling the generation of complex, consistent, and controllable dynamic scenes, moving beyond simple image generation to create plausible video content.
  • Visual representations can be learned from non-visual data modalities like mathematical equations, code, and natural language, suggesting that underlying visual structure might be universally encoded across different forms of data.

Methods / Models / Datasets Mentioned

  • CLIP Interrogator
  • Stable Diffusion 1.5
  • DeepLabv3
  • CLIP-ES
  • Principal Component Clustering (PCC)
  • Open Vocabulary Attention Maps (OVAM)
  • Intrinsic-LoRA (I-LoRA)
  • ResNet-50
  • VQ-VAE-2
  • BigGAN-deep
  • DALL-E 3
  • Midjourney
  • Emu (Meta)
  • Imagen (Google)
  • Firefly (Adobe)
  • SDXL (Stability.ai)
  • Sora (OpenAI)
  • Instant3D
  • Hertz (CVPR 2024)
  • MultiDiffusion (ICML 2023)
  • TokenFlow (ICLR 2024)
  • SceneScape (NeurIPS 2023)
  • DDIM Inversion
  • ResNet Block
  • SMM (Spatial Marginal Mean) Features
  • Exemplar VAE
  • VAE with Gaussian Prior
  • VAE with VampPrior
  • kNN classifier
  • ResNet-50
  • DeiT-S
  • DeiT-B
  • DeiT-L
  • R-50
  • R-101
  • R-152
  • DINO-v2
  • SynCLR
  • SynCLIP
  • StyleGAN v2
  • StyleGAN-XL
  • VQGAN
  • Pix2Video
  • Control-A-Video
  • Gen-1
  • Tune-A-Video
  • Layered Neural Atlases
  • Omnimatte
  • DDIM Inversion
  • CLIP text encoder
  • ResNet
  • Temporal ResNet
  • Spatial-attention
  • Cross-attention
  • Temporal-attention
  • Transformer
  • Objaverse
  • MVImgNet
  • ZeroVerse
  • 3DGS (3D Gaussian Splatting)
  • LRM (Large Reconstruction Model)
  • PFLRM (Pose-Free Large Reconstruction Model)
  • DMVD (Denoising Multi-View Diffusion)
  • ZeroScope

Topics

Generative AI · Synthetic Data Generation · Semantic Segmentation · Model Robustness · Intrinsic Feature Extraction · Text-to-Image Synthesis · Text-to-Video Models · Vision Learning without Visual Data · AI Pipeline Optimization · Data Augmentation


Notes

Open for commentary — connections to other work, critiques, follow-up reading.