7th Multi-modal Learning Workshop

Event: CVPR 2024 Workshop · Duration: 237 min · ▶ Watch on YouTube

Abstract

The 7th Multi-modal Learning Workshop at CVPR 2024 featured presentations on various cutting-edge research topics. Cees Snoek discussed multimodal learning under challenging visual conditions, emphasizing robustness to domain, modality, and resource shifts. Bhiman Kumar Baghel presented a method leveraging generative language models for weakly supervised sentence component analysis in video-language joint learning. Ankan Deria introduced InVERGe, an intelligent visual encoder for bridging modalities in medical report generation. Gül Varol presented the AutoAD Trilogy, focusing on audio description generation for movies. Finally, Laura Leal-Taixé discussed open-world 3D segmentation and tracking, highlighting the use of 2D foundation models for lidar pseudo-labeling.

Speakers

  • Cees Snoek — University of Amsterdam
  • Bhiman Kumar Baghel — Carnegie Mellon University
  • Ankan Deria — Jio Institute
  • Laura Leal-Taixé — NVIDIA
  • Gül Varol — École des Ponts ParisTech, France

Talks (5)

  • 00:00:00 — Cees Snoek: Multimodal Learning Under Visually Challenging Conditions
    • This talk discusses challenges and solutions in multimodal learning, particularly focusing on robustness to domain shift, modality shift, and resource shift under visually challenging conditions.
  • 02:28:19Bhiman Kumar Baghel: Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning
    • This presentation introduces a novel approach to leverage large language models for generating weak labels, which are then used in a contrastive learning framework to improve the performance of video-language models in tasks like moment retrieval and text-video retrieval.
  • 02:42:48Ankan Deria: InVERGe: Intelligent Visual Encoder for Bridging Modalities in Report Generation
    • This talk introduces InVERGe, a framework designed to generate medical reports from chest X-ray images by bridging visual and textual modalities using a self-supervised joint-embedding predictive architecture and a cross-modal query-fusion layer.
  • 03:02:49Gül Varol: AutoAD Trilogy: Audio Description Generation for Movies
    • This presentation introduces the AutoAD Trilogy, a series of works focused on generating audio descriptions for movies, addressing challenges related to data scarcity, multimodal input integration, and ensuring story-relevant, expressive, and coherent descriptions.
  • 03:38:57Laura Leal-Taixé: Open-world Segmentation and Tracking in 3D
    • This talk introduces a novel approach for open-world 3D segmentation and tracking in autonomous driving, leveraging 2D foundation models to generate pseudo-labels for 3D lidar data, and demonstrating improved performance and generalization capabilities.

Key Takeaways

  • Multimodal learning is crucial for robust AI systems, especially under challenging visual conditions like low light or adverse weather, where combining modalities like audio and visual data can significantly improve performance.
  • Bias in AI models, particularly in vision-language tasks like meme explanation generation, is a significant concern. Explicitly providing metadata and leveraging large language models can help identify and mitigate these biases, leading to more reliable and fair AI systems.
  • Generating synthetic instructions using large language models can effectively augment training data for vision-and-language navigation tasks, leading to improved performance and generalization capabilities, especially when dealing with data scarcity.
  • Pseudo-labeling, particularly by transferring knowledge from strong 2D foundation models to 3D lidar data, is a powerful tool for open-world 3D segmentation and tracking, enabling models to segment and classify novel objects without extensive manual 3D annotation.
  • Future research in multimodal learning should focus on developing methods for dynamic scene understanding in open-world 4D environments, exploring geometric and 3D motion cues, and creating models that can engage in dialogue with humans while navigating, ultimately leading to more robust, generalizable, and interactive AI agents.

Methods / Models / Datasets Mentioned

  • C3D
  • T-SNE
  • CLIP
  • BLIP-2
  • ZinD-Agent
  • GPT-4
  • SAM
  • DBSCAN
  • PointPillars
  • LLaVA 1.5
  • MiniGPT-4
  • GPT-2
  • BERT
  • InVERGe
  • CMQFL
  • Moment-DETR
  • QD-DETR
  • XCLIP
  • DUET
  • AIGen
  • AutoAD Trilogy
  • WhisperX
  • SAL
  • CLIP-R
  • BLIP-2
  • CLIP
  • SAM
  • DBScan++
  • SeMoLi

Topics

Multimodal Learning · Vision-Language Models · 3D Segmentation · Object Tracking · Audio Description Generation · Fairness in AI · Medical Imaging · Lidar Data · Pseudo-labeling · Open-world Learning


Notes

Open for commentary — connections to other work, critiques, follow-up reading.