CVPR 2024 Workshop: Multimodal Foundation Models

Event: CVPR 2024 Workshop · Duration: 252 min · ▶ Watch on YouTube

Abstract

This workshop focuses on the critical topic of evaluating multimodal foundation models, addressing their capabilities, limitations, and the future of their development. Speakers present novel benchmarks and metrics to assess compositional reasoning, context-sensitive visual understanding, and the robustness of models under various perturbations. Discussions highlight the shift towards data-centric evaluation, the challenges of multilingual and multimodal settings, and the importance of safety and trustworthiness in generative AI. The overarching goal is to foster a deeper understanding of how to build and evaluate more capable and reliable AI systems.

Speakers

Ludwig Schmidt — University of Washington
Ranjay Krishna — University of Washington
Seoi Jeong — UCLA
Zhiqu Lin — Carnegie Mellon University & Meta
Leonid Karlinsky — MIT-IBM Lab & IBM Research
Hanwang Zhang — Nanyang Technological University
Sadeep Jayasumana — Google Research
Bo Li — UChicago/UIUC & Virtue AI
Jungo Kasai — Kotoba Technologies

Talks (9)

00:00:00 — Ludwig Schmidt: DataComp: Evaluating Training Sets for Multimodal Models
- Discusses the importance of evaluation in AI progress, the shift towards data-centric ML, and introduces LAION-5B and DataComp benchmarks for multimodal training sets.
00:03:50 — Ranjay Krishna: The past, present, and future of Vision-Language Evaluation
- Explores the evolution of vision-language evaluation, highlighting the limitations of current metrics and proposing new methods for assessing compositional reasoning in models.
01:29:00 — Seoi Jeong: CONTEXTUAL: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
- Introduces CONTEXTUAL, a new benchmark for evaluating context-sensitive text-rich visual reasoning in large multimodal models, highlighting limitations of existing benchmarks.
01:35:25 — Zhiqu Lin: Evaluating and Improving Compositional Text-to-Visual Generation
- Presents a benchmark and methods for evaluating and improving compositional text-to-visual generation, focusing on fine-grained control and addressing randomness in inpainting.
01:51:50 — Leonid Karlinsky: Analyzing and improving compositional reasoning in multi-modal foundation models
- Discusses methods for teaching vision-language models compositional reasoning using synthetic data and instruction tuning, highlighting improvements in structured concept understanding and comparison abilities.
02:22:00 — Hanwang Zhang: Road to L5 MM Generalist
- Proposes a hierarchical framework for evaluating multimodal large language models (MM-LLMs) based on task unification and synergy across comprehension, generation, and language tasks, aiming for a Level 5 generalist.
03:10:00 — Sadeep Jayasumana: Rethinking FID: Towards a Better Evaluation Metric for Image Generation
- Critiques the Frechet Inception Distance (FID) metric for image generation, highlighting its reliance on Gaussian assumptions and proposing a new metric (CMMD) based on CLIP embeddings and MMD distance for more robust evaluation.
03:33:00 — Bo Li: Risk Assessment, Safety Enhancement, and Guardrails for Generative AI
- Addresses the critical need for safety and trustworthiness in generative AI, proposing a comprehensive framework for risk assessment, safety enhancement, and guardrail development for foundation models.
03:51:00 — Jungo Kasai: Dramatic Five Years of AI and NLP Evaluation and the Future of Foundation Models
- Reflects on the rapid advancements in AI and NLP over the past five years, emphasizing the challenges and future directions in evaluating large language models, particularly in multilingual and multimodal contexts.

Key Takeaways

Evaluation is crucial for AI progress, driving innovation and enabling measurement of progress.
There’s a growing shift towards data-centric machine learning, emphasizing the quality and curation of training data.
Multimodal foundation models, especially in vision and language, require robust evaluation methods that go beyond traditional benchmarks.
Compositional reasoning remains a significant challenge for current models, even the most advanced ones, across various tasks.
Developing trustworthy and safe generative AI systems necessitates comprehensive risk assessment, safety enhancement, and effective guardrails.

Methods / Models / Datasets Mentioned

LAION-5B
DataComp
CLIP
DALL-E
GPT-1
GPT-2
GPT-3
GPT-4
ResNets
SSD512
Faster R-CNN
D-RFCN + SNIP
Mask R-CNN
NAS-FPN
DetectoRS
DyHead
Swin-L
FocalNet-H (DINO)
OpenCLIP
Stable Diffusion
DFN (Data Filtering Networks)
T-MARS
CLIPA-v2
Flamingo
GPT-4o
InternVL-Chat-v1.2-34B
Qwen-VL-Max
Qwen-VL-Plus
GeminiProVision
GPT4V_20240409
LLAVA-NEXT-34B
XComposer2
BLIP2
LLaVA 1.5-7b
InstructBLIP
LLaMa-Adapter-V2
LLaVA
COCO
CLEVR
SAM (Segmentation)
VQGAN
CutMix
VAR (Vector Autoregression)
Mixup
GPT-time for Vision
ELMo (Peters et al., ACL 2017, NAACL 2018)
BERT
SeamlessM4T v2 (Meta)
NLP Pipeline (Larsson et al., 2017)
LayoutBench
ReCo
LDM
GLIGEN
ControlNet
DETR
CLIPScore
BLIP2v2Score
ImageReward
PickScore
HPSv2
VQ2
Davidsonian
VQAScore
DAC (Dense and Aligned Captions)
ConStruct-VL
Comparison Visual Instruction Tuning (CVIT)
ConMe (Rethinking Evaluation of Compositional Reasoning for Modern VLMs)
FID (Frechet Inception Distance)
CMMD (CLIP-MMD)
KID (Kernel Inception Distance)
MMD (Maximum Mean Discrepancy)
Inception Embeddings
CLIP Embeddings
Gaussian RBF kernel
Muse

Topics

Multimodal Foundation Models · Evaluation Metrics · Compositional Reasoning · Data-centric AI · Multilinguality · Generative AI Safety · Benchmarking · Vision-Language Models

Notes

Open for commentary — connections to other work, critiques, follow-up reading.