CVPR 2024 Workshop: Multimodal Foundation Models
Event: CVPR 2024 Workshop · Duration: 252 min · ▶ Watch on YouTube
Abstract
This workshop focuses on the critical topic of evaluating multimodal foundation models, addressing their capabilities, limitations, and the future of their development. Speakers present novel benchmarks and metrics to assess compositional reasoning, context-sensitive visual understanding, and the robustness of models under various perturbations. Discussions highlight the shift towards data-centric evaluation, the challenges of multilingual and multimodal settings, and the importance of safety and trustworthiness in generative AI. The overarching goal is to foster a deeper understanding of how to build and evaluate more capable and reliable AI systems.
Speakers
- Ludwig Schmidt — University of Washington
- Ranjay Krishna — University of Washington
- Seoi Jeong — UCLA
- Zhiqu Lin — Carnegie Mellon University & Meta
- Leonid Karlinsky — MIT-IBM Lab & IBM Research
- Hanwang Zhang — Nanyang Technological University
- Sadeep Jayasumana — Google Research
- Bo Li — UChicago/UIUC & Virtue AI
- Jungo Kasai — Kotoba Technologies
Talks (9)
- 00:00:00 — Ludwig Schmidt: DataComp: Evaluating Training Sets for Multimodal Models
- Discusses the importance of evaluation in AI progress, the shift towards data-centric ML, and introduces LAION-5B and DataComp benchmarks for multimodal training sets.
- 00:03:50 — Ranjay Krishna: The past, present, and future of Vision-Language Evaluation
- Explores the evolution of vision-language evaluation, highlighting the limitations of current metrics and proposing new methods for assessing compositional reasoning in models.
- 01:29:00 — Seoi Jeong: CONTEXTUAL: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
- Introduces CONTEXTUAL, a new benchmark for evaluating context-sensitive text-rich visual reasoning in large multimodal models, highlighting limitations of existing benchmarks.
- 01:35:25 — Zhiqu Lin: Evaluating and Improving Compositional Text-to-Visual Generation
- Presents a benchmark and methods for evaluating and improving compositional text-to-visual generation, focusing on fine-grained control and addressing randomness in inpainting.
- 01:51:50 — Leonid Karlinsky: Analyzing and improving compositional reasoning in multi-modal foundation models
- Discusses methods for teaching vision-language models compositional reasoning using synthetic data and instruction tuning, highlighting improvements in structured concept understanding and comparison abilities.
- 02:22:00 — Hanwang Zhang: Road to L5 MM Generalist
- Proposes a hierarchical framework for evaluating multimodal large language models (MM-LLMs) based on task unification and synergy across comprehension, generation, and language tasks, aiming for a Level 5 generalist.
- 03:10:00 — Sadeep Jayasumana: Rethinking FID: Towards a Better Evaluation Metric for Image Generation
- Critiques the Frechet Inception Distance (FID) metric for image generation, highlighting its reliance on Gaussian assumptions and proposing a new metric (CMMD) based on CLIP embeddings and MMD distance for more robust evaluation.
- 03:33:00 — Bo Li: Risk Assessment, Safety Enhancement, and Guardrails for Generative AI
- Addresses the critical need for safety and trustworthiness in generative AI, proposing a comprehensive framework for risk assessment, safety enhancement, and guardrail development for foundation models.
- 03:51:00 — Jungo Kasai: Dramatic Five Years of AI and NLP Evaluation and the Future of Foundation Models
- Reflects on the rapid advancements in AI and NLP over the past five years, emphasizing the challenges and future directions in evaluating large language models, particularly in multilingual and multimodal contexts.
Key Takeaways
- Evaluation is crucial for AI progress, driving innovation and enabling measurement of progress.
- There’s a growing shift towards data-centric machine learning, emphasizing the quality and curation of training data.
- Multimodal foundation models, especially in vision and language, require robust evaluation methods that go beyond traditional benchmarks.
- Compositional reasoning remains a significant challenge for current models, even the most advanced ones, across various tasks.
- Developing trustworthy and safe generative AI systems necessitates comprehensive risk assessment, safety enhancement, and effective guardrails.
Methods / Models / Datasets Mentioned
LAION-5BDataCompCLIPDALL-EGPT-1GPT-2GPT-3GPT-4ResNetsSSD512Faster R-CNND-RFCN + SNIPMask R-CNNNAS-FPNDetectoRSDyHeadSwin-LFocalNet-H (DINO)OpenCLIPStable DiffusionDFN (Data Filtering Networks)T-MARSCLIPA-v2FlamingoGPT-4oInternVL-Chat-v1.2-34BQwen-VL-MaxQwen-VL-PlusGeminiProVisionGPT4V_20240409LLAVA-NEXT-34BXComposer2BLIP2LLaVA 1.5-7bInstructBLIPLLaMa-Adapter-V2LLaVACOCOCLEVRSAM (Segmentation)VQGANCutMixVAR (Vector Autoregression)MixupGPT-time for VisionELMo (Peters et al., ACL 2017, NAACL 2018)BERTSeamlessM4T v2 (Meta)NLP Pipeline (Larsson et al., 2017)LayoutBenchReCoLDMGLIGENControlNetDETRCLIPScoreBLIP2v2ScoreImageRewardPickScoreHPSv2VQ2DavidsonianVQAScoreDAC (Dense and Aligned Captions)ConStruct-VLComparison Visual Instruction Tuning (CVIT)ConMe (Rethinking Evaluation of Compositional Reasoning for Modern VLMs)FID (Frechet Inception Distance)CMMD (CLIP-MMD)KID (Kernel Inception Distance)MMD (Maximum Mean Discrepancy)Inception EmbeddingsCLIP EmbeddingsGaussian RBF kernelMuse
Topics
Multimodal Foundation Models · Evaluation Metrics · Compositional Reasoning · Data-centric AI · Multilinguality · Generative AI Safety · Benchmarking · Vision-Language Models
Notes
Open for commentary — connections to other work, critiques, follow-up reading.