Black-box Adversarial Attacks on Vision Foundation Models

Event: CVPR 2024 Workshop Challenge · Duration: 272 min · ▶ Watch on YouTube

Abstract

This video presents the results and winning solutions of the CVPR 2024 Workshop Challenge on “Black-box Adversarial Attacks on Vision Foundation Models”. The challenge focused on exploring vulnerabilities of large vision-language models (LVLMs) in autonomous driving scenarios, involving tasks like color judgment, image classification, and object counting. Three top-performing teams shared their innovative attack methodologies, including typography-based, gradient-based, and precision-guided adversarial attacks, highlighting the importance of robustness in AI systems for safety-critical applications. The competition attracted over 100 teams and 160 competitors, demonstrating significant interest in advancing adversarial machine learning research. This segment features three talks on the safety and robustness of AI models. The first talk introduces DecodingTrust, a platform for comprehensive evaluation of LLM trustworthiness across various safety perspectives, highlighting the challenges of adversarial attacks and the need for regulation-based safety. The second talk delves into adversarial attacks on aligned language models, demonstrating how to bypass safety mechanisms in both open-source and black-box LLMs and discussing the transferability of these attacks. The final talk addresses the broader societal concerns and regulatory landscape of generative AI, defining safety and robustness, and presenting methods for preventing harmful content generation and detecting AI-generated content through techniques like watermarking and prompt analysis. This segment features several talks on adversarial attacks and the robustness of AI models, particularly in the context of foundation models and large language models (LLMs). Speakers discuss methods for detecting and attributing AI-generated content using watermarks, present novel attack frameworks combining typographic and gradient-based techniques, and explore the impact of common perturbations on LLM robustness. The segment also includes a retrospective on the past decade of adversarial examples, highlighting the shift in attack methodologies and the increasing importance of responsible disclosure in the era of powerful generative AI. This segment emphasizes the critical role of data in achieving reliable generalization for AI models, especially when transitioning from research to real-world deployment. It introduces the concept of ‘effective robustness’ to quantify generalization performance under distribution shifts, using ImageNet and its derived datasets as a case study. The talk highlights that CLIP’s significant robustness gains are primarily attributable to its diverse training data distribution, rather than language supervision directly promoting robustness. The DataComp project is presented as a novel benchmark and infrastructure designed to foster collaborative research on multimodal dataset creation and filtering, aiming to improve data quality and diversity for better model generalization, including an extension to large language models with DataComp-LM.

Speakers

Hongda Chen — Xi’an Jiaotong University
Nhat Chung — A*STAR
Jiyuan Fu — Fudan University
Bo Li — University of Illinois Urbana-Champaign
Zico Kolter — Carnegie Mellon University
Neil Gong — Duke University
Florian Tramèr — ETH Zurich
Eric Wallace — OpenAI
Ludwig Schmidt — University of Washington, AI2, Stanford

Talks (10)

00:06:18 — Hongda Chen: A Cooperation of Typography and Gradient-Based Attacks Against LVLMs in Autonomous Driving
- This talk details a cooperative approach combining typography and gradient-based attacks to exploit vulnerabilities in large vision-language models for autonomous driving scenarios.
00:12:49 — Nhat Chung: Typographic Attacks in Target Counting and Recognition
- This talk presents typographic attack methods, including image-level, patch-level, and multi-task compositions, to influence foundation models in counting and recognition tasks within autonomous driving.
00:17:20 — Jiyuan Fu: PG-Attack: A Precision-Guided Adversarial Attack Framework
- This talk introduces PG-Attack, a three-phase framework combining modality expansion, precision mask perturbation, and deceptive text patch attacks to achieve precise and effective adversarial results against vision-language models.
01:07:59 — Bo Li: DecodingTrust: Comprehensive Safety and Trustworthiness Evaluation Platform for LLMs
- Introduces DecodingTrust, a platform for evaluating LLM trustworthiness across various safety perspectives, highlighting adversarial robustness, privacy, and regulation-based safety categories.
02:01:59 — Zico Kolter: Adversarial Attacks on Aligned Language Models
- Demonstrates how to bypass LLM safety filters using adversarial attacks, discusses the history of adversarial robustness, and shows attack transferability from open-source to black-box models.
02:15:58 — Neil Gong: Image Watermarks - An Example (HiDDeN)
- Discusses image watermarks, their components (watermark, encoder, decoder), and their application in detecting and attributing AI-generated images, along with prompt injection attacks and LLM robustness.
02:47:58 — Florian Tramèr: Can we learn anything from the past decade of adversarial examples?
- Reflects on the evolution of adversarial examples over the past decade, highlighting what has and hasn’t changed, and the new challenges posed by foundation models.
03:01:59 — Neil Gong: Safe and Robust Generative AI
- Addresses societal concerns and regulatory landscapes of generative AI, defining safety and robustness, and presenting methods for preventing harmful content and detecting AI-generated content.
03:22:58 — Eric Wallace: Making “GPT-Next” Trustworthy
- Focuses on strategies and challenges in making the next generation of large language models (LLMs) more trustworthy, robust, and private.
03:23:57 — Ludwig Schmidt: A data-centric view on reliable generalization
- This talk explores the importance of data-centric approaches for achieving reliable generalization in AI models, focusing on distribution shifts, robustness, and the role of training data quality and diversity, particularly in the context of CLIP and large language models.

Key Takeaways

Foundation models, despite their capabilities, are vulnerable to adversarial attacks, posing significant risks in safety-critical applications like autonomous driving.
Novel attack strategies, including typography-based and gradient-based methods, can effectively mislead LVLMs in tasks such as color judgment, image classification, and object counting.
The challenge fostered research into both white-box and black-box attack scenarios, emphasizing the need for transferable and robust attack methods.
Developing robust defense mechanisms and understanding the vulnerabilities of AI models is crucial for building more reliable and trustworthy AI systems.
LLMs exhibit varying levels of trustworthiness across different safety perspectives, and current models, even advanced ones, are not immune to vulnerabilities.
Adversarial attacks can effectively bypass safety alignments in LLMs, and these attacks demonstrate transferability across different model architectures and types (open-source to black-box).
The increasing integration of LLMs into larger systems necessitates robust safety measures, as adversarial interactions can pose significant security vulnerabilities.
Novel techniques like watermarking and gradient-based safety filters are being developed to detect AI-generated content and enhance the robustness of generative AI models against malicious prompts.
Image watermarking can be used for user-aware detection and attribution of AI-generated images, but white-box attacks can easily break them.
Prompt injection attacks are a pervasive threat to LLMs due to their instruction-following nature and the inseparability of instruction and data in prompts.
Adversarial attacks are evolving from simple image misclassification to more complex multimodal and prompt-based attacks against foundation models.
Responsible disclosure and robust evaluation methodologies are crucial for addressing the security and trustworthiness challenges posed by advanced AI systems.
Data-centric approaches are crucial for improving the reliability and generalization of AI models, especially when facing real-world distribution shifts.
The diversity and quality of training data distribution are key factors for achieving robustness, as demonstrated by CLIP’s performance on various out-of-distribution datasets.
The DataComp benchmark provides a collaborative framework and infrastructure for developing and evaluating multimodal datasets, aiming to systematically improve data quality and diversity.
Language supervision indirectly enhances robustness by facilitating the collection of more diverse and flexible training data, which is a significant advantage over rigid class structures.

Methods / Models / Datasets Mentioned

ALBEF
Agent-DecodingTrust
AlexNet
AutoDAN
AutoPrompt
BERT
BLIP
Backdoor Attack
BackdoorAlign
CIFAR-10-C
CLIP
CMI-Attack
CNNs
Camelyon17-WILDS
ChatGPT
Claude-1
Claude-2
Concept Erasure
Conceptual Captions
DALL-E
DCLM-Baseline
DFN
DPO
DTP-Attack
DataComp
DataComp-1B
DataComp-LM
DecodingTrust
DenseNet
DiffPure
EfficientNet-B7
FMoW-WILDS
Flamingo
GCG
GPT-2
GPT-3
GPT-4
GPT-4-Turbo
GPT-4V
GradSafe
Greedy Coordinate Gradient (GCG)
HarmBench
HiDDeN
HotFlip
ImageNet
ImageNet-21k
ImageNet-Adversarial
ImageNet-R
ImageNet-Rendition
ImageNet-Sketch
ImageNetV2
Imp-v1-3b
Inception
Instagram 1B
JFT-300M
LAION-2B
LAION-5B
Llama 3 8B
Llama Guard
Llama-2 (Chat 7B)
MMD-DecodingTrust
Mask Patch
Mistral-7B-v0.3
NASNet
OAI-WIT-400M
ObjectNet
OpenReview
PG-Attack
PaLM-2 (Bard)
Qwen-vl-7b
RLHF
RedCaps
RedCaps12m
ResNeXt
ResNet
ShutterStock15M
Shutterstock
SneakyPrompt
Stable Diffusion
Stable Signature
StegaStamp
TDC
Tree-ring
VGG
ViT
ViTs
Vicuna-7B
Virtue AI
WISE-FT
WIT
WIT12m
YFCC
YFCC15M
YOLOv8
iWildCam-WILDS

Topics

AI Safety · AI regulation · AI trustworthiness · Adversarial Attacks · Autonomous Driving · Black-box Attacks · CLIP · Data-centric AI · DataComp · Distribution shifts · Foundation Models · Gradient-Based Attacks · Image Watermarking · ImageNet · LLM safety · Large Language Models · Prompt Injection · Reliable generalization · Responsible Disclosure · Robustness · Safety Alignment · Typographic Attacks · Vision Foundation Models (VFMs) · adversarial attacks · generative AI robustness · harmful content detection · multimodal models · prompt injection · watermarking

Notes

Open for commentary — connections to other work, critiques, follow-up reading.