CVPR 2024 Workshop: Multimodal Foundation Models

Event: CVPR 2024 Workshop · Duration: 252 min · ▶ Watch on YouTube

Abstract

This workshop focuses on the critical topic of evaluating multimodal foundation models, addressing their capabilities, limitations, and the future of their development. Speakers present novel benchmarks and metrics to assess compositional reasoning, context-sensitive visual understanding, and the robustness of models under various perturbations. Discussions highlight the shift towards data-centric evaluation, the challenges of multilingual and multimodal settings, and the importance of safety and trustworthiness in generative AI. The overarching goal is to foster a deeper understanding of how to build and evaluate more capable and reliable AI systems.

Speakers

  • Ludwig Schmidt — University of Washington
  • Ranjay Krishna — University of Washington
  • Seoi Jeong — UCLA
  • Zhiqu Lin — Carnegie Mellon University & Meta
  • Leonid Karlinsky — MIT-IBM Lab & IBM Research
  • Hanwang Zhang — Nanyang Technological University
  • Sadeep Jayasumana — Google Research
  • Bo Li — UChicago/UIUC & Virtue AI
  • Jungo Kasai — Kotoba Technologies

Talks (9)

  • 00:00:00 — Ludwig Schmidt: DataComp: Evaluating Training Sets for Multimodal Models
    • Discusses the importance of evaluation in AI progress, the shift towards data-centric ML, and introduces LAION-5B and DataComp benchmarks for multimodal training sets.
  • 00:03:50Ranjay Krishna: The past, present, and future of Vision-Language Evaluation
    • Explores the evolution of vision-language evaluation, highlighting the limitations of current metrics and proposing new methods for assessing compositional reasoning in models.
  • 01:29:00Seoi Jeong: CONTEXTUAL: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
    • Introduces CONTEXTUAL, a new benchmark for evaluating context-sensitive text-rich visual reasoning in large multimodal models, highlighting limitations of existing benchmarks.
  • 01:35:25Zhiqu Lin: Evaluating and Improving Compositional Text-to-Visual Generation
    • Presents a benchmark and methods for evaluating and improving compositional text-to-visual generation, focusing on fine-grained control and addressing randomness in inpainting.
  • 01:51:50Leonid Karlinsky: Analyzing and improving compositional reasoning in multi-modal foundation models
    • Discusses methods for teaching vision-language models compositional reasoning using synthetic data and instruction tuning, highlighting improvements in structured concept understanding and comparison abilities.
  • 02:22:00Hanwang Zhang: Road to L5 MM Generalist
    • Proposes a hierarchical framework for evaluating multimodal large language models (MM-LLMs) based on task unification and synergy across comprehension, generation, and language tasks, aiming for a Level 5 generalist.
  • 03:10:00Sadeep Jayasumana: Rethinking FID: Towards a Better Evaluation Metric for Image Generation
    • Critiques the Frechet Inception Distance (FID) metric for image generation, highlighting its reliance on Gaussian assumptions and proposing a new metric (CMMD) based on CLIP embeddings and MMD distance for more robust evaluation.
  • 03:33:00Bo Li: Risk Assessment, Safety Enhancement, and Guardrails for Generative AI
    • Addresses the critical need for safety and trustworthiness in generative AI, proposing a comprehensive framework for risk assessment, safety enhancement, and guardrail development for foundation models.
  • 03:51:00Jungo Kasai: Dramatic Five Years of AI and NLP Evaluation and the Future of Foundation Models
    • Reflects on the rapid advancements in AI and NLP over the past five years, emphasizing the challenges and future directions in evaluating large language models, particularly in multilingual and multimodal contexts.

Key Takeaways

  • Evaluation is crucial for AI progress, driving innovation and enabling measurement of progress.
  • There’s a growing shift towards data-centric machine learning, emphasizing the quality and curation of training data.
  • Multimodal foundation models, especially in vision and language, require robust evaluation methods that go beyond traditional benchmarks.
  • Compositional reasoning remains a significant challenge for current models, even the most advanced ones, across various tasks.
  • Developing trustworthy and safe generative AI systems necessitates comprehensive risk assessment, safety enhancement, and effective guardrails.

Methods / Models / Datasets Mentioned

  • LAION-5B
  • DataComp
  • CLIP
  • DALL-E
  • GPT-1
  • GPT-2
  • GPT-3
  • GPT-4
  • ResNets
  • SSD512
  • Faster R-CNN
  • D-RFCN + SNIP
  • Mask R-CNN
  • NAS-FPN
  • DetectoRS
  • DyHead
  • Swin-L
  • FocalNet-H (DINO)
  • OpenCLIP
  • Stable Diffusion
  • DFN (Data Filtering Networks)
  • T-MARS
  • CLIPA-v2
  • Flamingo
  • GPT-4o
  • InternVL-Chat-v1.2-34B
  • Qwen-VL-Max
  • Qwen-VL-Plus
  • GeminiProVision
  • GPT4V_20240409
  • LLAVA-NEXT-34B
  • XComposer2
  • BLIP2
  • LLaVA 1.5-7b
  • InstructBLIP
  • LLaMa-Adapter-V2
  • LLaVA
  • COCO
  • CLEVR
  • SAM (Segmentation)
  • VQGAN
  • CutMix
  • VAR (Vector Autoregression)
  • Mixup
  • GPT-time for Vision
  • ELMo (Peters et al., ACL 2017, NAACL 2018)
  • BERT
  • SeamlessM4T v2 (Meta)
  • NLP Pipeline (Larsson et al., 2017)
  • LayoutBench
  • ReCo
  • LDM
  • GLIGEN
  • ControlNet
  • DETR
  • CLIPScore
  • BLIP2v2Score
  • ImageReward
  • PickScore
  • HPSv2
  • VQ2
  • Davidsonian
  • VQAScore
  • DAC (Dense and Aligned Captions)
  • ConStruct-VL
  • Comparison Visual Instruction Tuning (CVIT)
  • ConMe (Rethinking Evaluation of Compositional Reasoning for Modern VLMs)
  • FID (Frechet Inception Distance)
  • CMMD (CLIP-MMD)
  • KID (Kernel Inception Distance)
  • MMD (Maximum Mean Discrepancy)
  • Inception Embeddings
  • CLIP Embeddings
  • Gaussian RBF kernel
  • Muse

Topics

Multimodal Foundation Models · Evaluation Metrics · Compositional Reasoning · Data-centric AI · Multilinguality · Generative AI Safety · Benchmarking · Vision-Language Models


Notes

Open for commentary — connections to other work, critiques, follow-up reading.