23642 2nd Workshop on Multimodal Content Moderation mp4

Event: 2024 IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) - Workshop on Multimodal Content Moderation (MMCM) · Duration: 408 min · ▶ Watch on YouTube

Abstract

This segment covers the opening remarks and three talks from the “WORKSHOP ON MULTIMODAL CONTENT MODERATION (MMCM)” at CVPR 2024. Mei Chen introduces the workshop, emphasizing the increasing relevance of multimodal content moderation due to the proliferation of AI-generated content and outlining the day’s schedule. Dr. Symeon Papadopoulos then presents his work on leveraging AI for content moderation, including methods to refine models using large datasets and a user study on AI filters to mitigate the emotional impact of disturbing imagery. Following this, Marlyn Thomas Savio discusses the human aspect of content moderation, highlighting the challenges faced by human moderators in an evolving web landscape and proposing a framework for their wellness and support. The segment concludes with the introduction of Peyman Najafirad, who is set to discuss content safeguarding in the Generative AI era. This segment features two talks on content moderation. The first talk addresses the challenges of content moderation in the generative AI era, proposing a regulatory content moderation system that leverages conditional vision-language models and counterfactual subobject explanations for content obfuscation. The second talk delves into the experiences of volunteer content moderators on Reddit, specifically within the AskHistorians subreddit. It highlights the difficulties they face in managing harmful content, including Holocaust denial and racist stereotypes, and explores the potential for AI-assisted tools to support their efforts. This segment features two talks on content moderation. The first talk delves into the complexities of moderating ephemeral voice chat, emphasizing that such content is often unreported, expensive to process, and requires adaptive strategies that prioritize recall over precision. It highlights the importance of understanding emotional harms and relationship dynamics in these contexts. The second talk focuses on enhancing visual content safety in multimodal AI, particularly text-to-image models. It discusses the inherent risks and biases present in large-scale training datasets, such as misogyny and pornography, and introduces methodologies like red teaming and the I2P benchmark for evaluating and mitigating the generation of unsafe images. This segment features two speakers discussing challenges and advancements in AI. Susanna Ricco from Google addresses how AI models can perpetuate harmful stereotypes through generated images, focusing on Google’s AI Principles and methods to mitigate bias in text-to-image models. Lin Ai from Columbia Engineering presents a multimodal deception detection system that leverages acoustic, visual, and lexical features, highlighting its performance and future applications. Both speakers emphasize the importance of ethical considerations and robust methodologies in AI development. This segment features a series of talks and discussions on critical aspects of AI safety and fairness, particularly in the context of image generation and visual question answering (VQA). Speakers delve into the importance of ethical considerations, model evaluation, and the development of robust techniques to mitigate bias and prevent the generation of harmful content. The segment also includes a presentation on an end-to-end vision transformer approach for image copy detection, highlighting its effectiveness and the challenges it addresses.

Speakers

Mei Chen — Microsoft
Dr. Symeon (Akis) Papadopoulos — Principal Researcher @ CERTH/ITI, Head of MeVer group, Information Technologies Institute, Greece
Marlyn Thomas Savio — TaskUs
Peyman Najafirad (Paul Rad) — Associate Professor, Sr. Member of the National Academy of Inventors (NAI), The University of Texas at San Antonio
Sarah Gilbert — Cornell University
Mike Pappas — Modulate
Manuel Brack — DFKI AI
Susanna Ricco — Google
Lin Ai — Columbia Engineering
Lee Jiahe Steven — National University of Singapore

Talks (24)

00:01:39 — Mei Chen: Content Moderation: Multimodal
- Introduction to the workshop on multimodal content moderation, highlighting the growing importance due to AI-generated content, and outlining the day’s agenda and speakers.
00:07:29 — Dr. Symeon (Akis) Papadopoulos: Leveraging AI for Mitigating Viewer Impact from Disturbing Imagery
- This talk presents two strategies (CM-Refinery) for refining content moderation models using large datasets and a user study on AI style transfer filters to mitigate the emotional impact of disturbing imagery.
01:10:00 — Marlyn Thomas Savio: Human Moderation in an Evolving Web Landscape: Wellness Gaps and Needs
- This talk emphasizes the critical role of human moderators in safeguarding online spaces, discusses the VUCA nature of user-generated content, and proposes a “3 C’s” framework (Clarity, Change, Coping) to support moderator wellness and productivity.
01:21:31 — Peyman Najafirad (Paul Rad): Content Safeguarding in the Generative AI Era
- This talk discusses challenges in content moderation and proposes a regulatory content moderation system using conditional vision-language models and counterfactual subobject explanations for content obfuscation.
02:21:01 — Sarah Gilbert: Do moderators dream of AI support?
- This talk explores the challenges faced by volunteer content moderators on Reddit, particularly in the AskHistorians subreddit, and investigates how AI-assisted tools could support their work in managing harmful content.
02:43:03 — Mike Pappas: Ephemeral Chat Moderation: Understanding Invisible Harms
- This talk explores the unique challenges of moderating ephemeral voice chat, contrasting it with semi-permanent text and streaming content, highlighting issues like underreporting, processing costs, and the need for adaptive strategies that prioritize recall and understand emotional and relational harms.
03:27:28 — Manuel Brack: Enhancing Visual Content Safety: Multimodal Approaches for Dataset Curation and Model Safeguarding
- This presentation addresses the critical issue of content safety in multimodal vision models, particularly text-to-image generation, by examining the risks associated with large-scale, web-scraped datasets, proposing methods like red teaming and benchmarks (I2P) for evaluation, and discussing strategies to mitigate the generation of unsafe and biased content.
04:04:34 — Susanna Ricco: None
- The speaker discusses how AI models can perpetuate harmful stereotypes through generated images, focusing on how people are portrayed and who is portrayed. She introduces Google’s AI Principles and the concept of harm amplification, particularly oversexualization, in text-to-image models. She also highlights the importance of measuring and mitigating these biases, proposing methods like better captions and less NSFW content. The speaker then delves into the challenges of representing people diversity, introducing the Monk Skin Tone Scale and the PATHS model for holistic perception and preferences, emphasizing the need for flexible and scalable systems aligned with human experiences.
04:04:34 — Lin Ai: Multimodal Deception Detection using Automatically Extracted Acoustic, Visual, and Lexical Features
- The speaker presents a multimodal deception detection system that combines automatically extracted acoustic, visual, and lexical features. They discuss the challenges of processing noisy audio and varying visual data, and introduce methods like OpenSMILE for acoustic features and Fisher Vector Encoding for visual features. The system also utilizes Google Cloud ASR for lexical features, despite a high word error rate. The speaker highlights the model’s performance in detecting deception at both utterance and round levels, achieving 73% accuracy with combined features, and discusses future work including evaluating the model on real-world data and exploring trust analysis.
05:26:06 — Susanna Ricco: Remember the context
- The speaker emphasizes the importance of conditioning on text queries when evaluating diversity in image sets to avoid misaligning results and reinforcing harmful stereotypes.
05:36:56 — Susanna Ricco: When and how to answer
- The speaker introduces the concept of ungrounded VQA queries, where models should politely refuse to answer questions if there’s insufficient visual information, to avoid hallucination and bias.
05:54:46 — Susanna Ricco: Measuring desired behavior
- The speaker discusses how Gemini models are evaluated for desired behavior in VQA queries, noting improvements in refusal rates for ungrounded queries while maintaining performance on grounded ones.
06:02:16 — Susanna Ricco: Autorater quality matters
- The speaker highlights the importance of autorater quality in VQA, demonstrating how a model can hallucinate information for an ungrounded query, leading to an undesirable response.
06:12:46 — Susanna Ricco: Techniques to improve behavior
- The speaker outlines techniques like pre-training, SFT, and RLHF used to improve model behavior, focusing on data filtering, providing examples, and optimizing for human preferences to avoid harmful stereotypes.
06:22:06 — Susanna Ricco: Make AI helpful for everyone
- The speaker concludes by reiterating the mission to build AI responsibly and inclusively, emphasizing the need for models to be fair and optimized for diverse user needs, considering both well-intentioned and adversarial uses.
07:06:06 — Lee Jiahe Steven: An End-to-End Vision Transformer Approach for Image Copy Detection
- The speaker introduces CE Detector, an end-to-end transformer-based solution for image copy detection that leverages both global and local features to effectively identify edited copies of source images.
07:17:06 — Lee Jiahe Steven: Introduction
- The speaker highlights the importance of image copy detection in maintaining digital media integrity and combating misinformation, addressing challenges like identifying source images, detecting heavy edits, and handling overlays.
07:26:06 — Lee Jiahe Steven: Overview of Proposed Approach
- The speaker details the CE Detector’s five key components: image patch extraction, feature extraction using DINO Vision Transformer, feature aggregation, retrieval of top-K reference candidates, and copy-edit classification using a transformer encoder.
07:36:06 — Lee Jiahe Steven: Proposed Approach
- The speaker explains the feature aggregation process, which involves element-wise multiplication of attention scores and local descriptors, GeM pooling and whitening, and concatenation of global and regional features to form a robust descriptor.
08:01:06 — Lee Jiahe Steven: Results
- The speaker presents experimental results showing CE Detector’s superior performance over state-of-the-art methods on ISC and NDEC datasets, demonstrating its effectiveness in detecting copy-edited images, even in challenging scenarios.
08:16:06 — Lee Jiahe Steven: Case study
- The speaker presents case studies illustrating CE Detector’s ability to focus on partially occluded objects for accurate copy detection, while also highlighting challenges when reference images are severely cropped or embedded within the query image.
08:23:06 — Lee Jiahe Steven: Conclusion
- The speaker concludes that CE Detector, an end-to-end Transformer-based solution leveraging global and local features, outperforms state-of-the-art methods in image copy detection and excels at focusing on salient objects.
08:36:06 — Mike Pappas: None
- The speaker discusses the challenges of defining and enforcing content policies for AI-generated content, especially concerning subjective interpretations of harm and the need for adaptive policies in a rapidly evolving digital landscape.

Key Takeaways

The rapid growth of AI-generated content, particularly multimodal, necessitates advanced and ethical content moderation strategies.
AI-powered tools can be developed to refine content moderation models and mitigate the emotional impact of disturbing imagery on human moderators.
Human moderators play an indispensable role in online safety, especially in handling ambiguous content and where algorithms fail, requiring support for their wellness and efficient workflows.
Emerging technologies like Generative AI and Virtual Reality introduce new and complex moderation challenges that demand innovative solutions and a focus on human well-being.
Content moderation faces significant challenges related to scale, human judgment under stress, subjective judgments, and mental well-being of moderators.
A proposed regulatory content moderation system aims to convert policies into actionable rules, remove only harmful portions of content, and provide clear reasoning for moderation decisions.
Counterfactual Subobject Explanations can identify minimal regions in an image to obfuscate, transforming unsafe images into safe ones while preserving overall integrity.
Conditional Vision-Language Models can be used to pre-filter or identify harmful content, reducing moderators’ exposure and providing detailed explanations for moderation actions without full model retraining.
Moderating ephemeral voice chat requires adaptive strategies that prioritize recall over precision due to the high cost of processing and the tendency for harms to go unreported.
Understanding emotional rapport and relationship dynamics is crucial for effective voice chat moderation, as harms in these contexts can be deeply personal and exploit established bonds.
Large-scale multimodal datasets, often scraped from the web, contain significant biases (e.g., misogyny, pornography, stereotypes) that generative AI models inevitably learn and reflect.
Red teaming and benchmarks like I2P are essential for proactively identifying and mitigating the generation of unsafe and biased content in text-to-image models, allowing for standardized evaluation of model safety and mitigation strategies.
AI models, particularly text-to-image generators, can inadvertently perpetuate harmful stereotypes and biases, affecting how people are portrayed and who is represented.
Google’s AI Principles and the concept of harm amplification (e.g., oversexualization) are crucial frameworks for addressing these biases, requiring robust measurement and mitigation strategies.
Multimodal approaches combining acoustic, visual, and lexical features can be effective in tasks like deception detection, even with challenges such as noisy audio and high word error rates in ASR transcripts.
The development of tools like the Monk Skin Tone Scale and the PATHS model aims to create more inclusive and representative AI systems by aligning computational analysis with human perception and preferences across diverse populations.
Evaluating AI models for diversity and fairness requires careful consideration of text queries and prompts to avoid misaligning results and reinforcing harmful stereotypes.
Models should be designed to politely refuse to answer ungrounded VQA queries (where visual information is insufficient) to prevent hallucination and maintain trustworthiness.
End-to-end vision transformer approaches, like CE Detector, show promise in image copy detection by effectively leveraging both global and local features to identify edited images, even with heavy manipulations.
The development of AI content policies needs to balance the need for consistency across diverse content with the subjective and evolving nature of what constitutes ‘harm,’ requiring continuous adaptation and robust evaluation methods.

Methods / Models / Datasets Mentioned

AblationCAM
Adversarial Nibbler
AltDiffusion
Bayesian Adaptive Superpixel Segmentation
CE Detector
CM-Refinery
Common Crawl
ConditionalBLIP
ConditionalVLM
Counterfactual Subobject Explanations
DALL-E
DINO Vision Transformer
Deepfloyd-IF
DiffPD
EfficientNet-b1
Fisher Vector Encoding
FullGrad
GLIDE
GeM Pooling
Gemini 1.0 Ultra
Gemini 1.5 Flash
Gemini 1.5 Pro
Google Cloud ASR
Grad-CAM
Grad-CAM++
I2P Benchmark
ISC dataset
Imagen
InstructBLIP
LAION-5B
LIWC
Lexica
MAAM (Media Asset Annotation and Management)
Midjourney
Monk Skin Tone Scale
Multi-Head Cross-Attention
Multiffusion
NDEC dataset
Negative Prompting
OFA-Large
Open AI DALL-E
OpenSMILE
PANAS scale
PATHS model
POS tagging
Paella
Point-E
Progressive Attentional Manifold Alignment (PAMA) algorithm
Prompting4Debugging
RLHF
Random Forest
SD 1.4
SD 1.5
SD 2.0
SD 2.1
SD Cutesexyrobots
SD Dreamlike Photoreal
SD Epic Diffusion
SEGA
SFT
Safe Latent Diffusion
Self-Attention
Stable Diffusion
Statistical Triage
Transformer Encoder
Unigram
Word2Vec
XGrad-CAM
mPLUG

Topics

AI Filters for Content Mitigation · AI Safety · AI assistance · AI bias · AI ethics · AI safety · AI-Generated Content · Bias mitigation · Content Moderation · Content Safeguarding · Content moderation · Dataset Bias · Deception detection · Disinformation and Workflow Challenges · Emerging Technologies in Moderation · Ephemeral Chat · Fairness · Generative AI · Harm Detection · Harm amplification · Holocaust denial · Human-in-the-Loop Moderation · Image copy detection · Image generation · Moderator Wellness and Support · Multimodal AI · Multimodal Content Moderation · Red Teaming · Reddit moderation · Skin tone scale · Stereotypes · Text-to-Image Generation · Text-to-image models · Transformer models · Visual Content Safety · Visual Question Answering (VQA) · Voice Chat · content moderation · content obfuscation · counterfactual explanations · generative AI · racism · vision-language models · volunteer moderators

Notes

Open for commentary — connections to other work, critiques, follow-up reading.