Black-box Adversarial Attacks on Vision Foundation Models

Event: CVPR 2024 Workshop Challenge · Duration: 272 min · ▶ Watch on YouTube

Abstract

This video presents the results and winning solutions of the CVPR 2024 Workshop Challenge on “Black-box Adversarial Attacks on Vision Foundation Models”. The challenge focused on exploring vulnerabilities of large vision-language models (LVLMs) in autonomous driving scenarios, involving tasks like color judgment, image classification, and object counting. Three top-performing teams shared their innovative attack methodologies, including typography-based, gradient-based, and precision-guided adversarial attacks, highlighting the importance of robustness in AI systems for safety-critical applications. The competition attracted over 100 teams and 160 competitors, demonstrating significant interest in advancing adversarial machine learning research. This segment features three talks on the safety and robustness of AI models. The first talk introduces DecodingTrust, a platform for comprehensive evaluation of LLM trustworthiness across various safety perspectives, highlighting the challenges of adversarial attacks and the need for regulation-based safety. The second talk delves into adversarial attacks on aligned language models, demonstrating how to bypass safety mechanisms in both open-source and black-box LLMs and discussing the transferability of these attacks. The final talk addresses the broader societal concerns and regulatory landscape of generative AI, defining safety and robustness, and presenting methods for preventing harmful content generation and detecting AI-generated content through techniques like watermarking and prompt analysis. This segment features several talks on adversarial attacks and the robustness of AI models, particularly in the context of foundation models and large language models (LLMs). Speakers discuss methods for detecting and attributing AI-generated content using watermarks, present novel attack frameworks combining typographic and gradient-based techniques, and explore the impact of common perturbations on LLM robustness. The segment also includes a retrospective on the past decade of adversarial examples, highlighting the shift in attack methodologies and the increasing importance of responsible disclosure in the era of powerful generative AI. This segment emphasizes the critical role of data in achieving reliable generalization for AI models, especially when transitioning from research to real-world deployment. It introduces the concept of ‘effective robustness’ to quantify generalization performance under distribution shifts, using ImageNet and its derived datasets as a case study. The talk highlights that CLIP’s significant robustness gains are primarily attributable to its diverse training data distribution, rather than language supervision directly promoting robustness. The DataComp project is presented as a novel benchmark and infrastructure designed to foster collaborative research on multimodal dataset creation and filtering, aiming to improve data quality and diversity for better model generalization, including an extension to large language models with DataComp-LM.

Speakers

  • Hongda Chen — Xi’an Jiaotong University
  • Nhat Chung — A*STAR
  • Jiyuan Fu — Fudan University
  • Bo Li — University of Illinois Urbana-Champaign
  • Zico Kolter — Carnegie Mellon University
  • Neil Gong — Duke University
  • Florian Tramèr — ETH Zurich
  • Eric Wallace — OpenAI
  • Ludwig Schmidt — University of Washington, AI2, Stanford

Talks (10)

  • 00:06:18Hongda Chen: A Cooperation of Typography and Gradient-Based Attacks Against LVLMs in Autonomous Driving
    • This talk details a cooperative approach combining typography and gradient-based attacks to exploit vulnerabilities in large vision-language models for autonomous driving scenarios.
  • 00:12:49Nhat Chung: Typographic Attacks in Target Counting and Recognition
    • This talk presents typographic attack methods, including image-level, patch-level, and multi-task compositions, to influence foundation models in counting and recognition tasks within autonomous driving.
  • 00:17:20Jiyuan Fu: PG-Attack: A Precision-Guided Adversarial Attack Framework
    • This talk introduces PG-Attack, a three-phase framework combining modality expansion, precision mask perturbation, and deceptive text patch attacks to achieve precise and effective adversarial results against vision-language models.
  • 01:07:59Bo Li: DecodingTrust: Comprehensive Safety and Trustworthiness Evaluation Platform for LLMs
    • Introduces DecodingTrust, a platform for evaluating LLM trustworthiness across various safety perspectives, highlighting adversarial robustness, privacy, and regulation-based safety categories.
  • 02:01:59Zico Kolter: Adversarial Attacks on Aligned Language Models
    • Demonstrates how to bypass LLM safety filters using adversarial attacks, discusses the history of adversarial robustness, and shows attack transferability from open-source to black-box models.
  • 02:15:58Neil Gong: Image Watermarks - An Example (HiDDeN)
    • Discusses image watermarks, their components (watermark, encoder, decoder), and their application in detecting and attributing AI-generated images, along with prompt injection attacks and LLM robustness.
  • 02:47:58Florian Tramèr: Can we learn anything from the past decade of adversarial examples?
    • Reflects on the evolution of adversarial examples over the past decade, highlighting what has and hasn’t changed, and the new challenges posed by foundation models.
  • 03:01:59Neil Gong: Safe and Robust Generative AI
    • Addresses societal concerns and regulatory landscapes of generative AI, defining safety and robustness, and presenting methods for preventing harmful content and detecting AI-generated content.
  • 03:22:58Eric Wallace: Making “GPT-Next” Trustworthy
    • Focuses on strategies and challenges in making the next generation of large language models (LLMs) more trustworthy, robust, and private.
  • 03:23:57Ludwig Schmidt: A data-centric view on reliable generalization
    • This talk explores the importance of data-centric approaches for achieving reliable generalization in AI models, focusing on distribution shifts, robustness, and the role of training data quality and diversity, particularly in the context of CLIP and large language models.

Key Takeaways

  • Foundation models, despite their capabilities, are vulnerable to adversarial attacks, posing significant risks in safety-critical applications like autonomous driving.
  • Novel attack strategies, including typography-based and gradient-based methods, can effectively mislead LVLMs in tasks such as color judgment, image classification, and object counting.
  • The challenge fostered research into both white-box and black-box attack scenarios, emphasizing the need for transferable and robust attack methods.
  • Developing robust defense mechanisms and understanding the vulnerabilities of AI models is crucial for building more reliable and trustworthy AI systems.
  • LLMs exhibit varying levels of trustworthiness across different safety perspectives, and current models, even advanced ones, are not immune to vulnerabilities.
  • Adversarial attacks can effectively bypass safety alignments in LLMs, and these attacks demonstrate transferability across different model architectures and types (open-source to black-box).
  • The increasing integration of LLMs into larger systems necessitates robust safety measures, as adversarial interactions can pose significant security vulnerabilities.
  • Novel techniques like watermarking and gradient-based safety filters are being developed to detect AI-generated content and enhance the robustness of generative AI models against malicious prompts.
  • Image watermarking can be used for user-aware detection and attribution of AI-generated images, but white-box attacks can easily break them.
  • Prompt injection attacks are a pervasive threat to LLMs due to their instruction-following nature and the inseparability of instruction and data in prompts.
  • Adversarial attacks are evolving from simple image misclassification to more complex multimodal and prompt-based attacks against foundation models.
  • Responsible disclosure and robust evaluation methodologies are crucial for addressing the security and trustworthiness challenges posed by advanced AI systems.
  • Data-centric approaches are crucial for improving the reliability and generalization of AI models, especially when facing real-world distribution shifts.
  • The diversity and quality of training data distribution are key factors for achieving robustness, as demonstrated by CLIP’s performance on various out-of-distribution datasets.
  • The DataComp benchmark provides a collaborative framework and infrastructure for developing and evaluating multimodal datasets, aiming to systematically improve data quality and diversity.
  • Language supervision indirectly enhances robustness by facilitating the collection of more diverse and flexible training data, which is a significant advantage over rigid class structures.

Methods / Models / Datasets Mentioned

  • ALBEF
  • Agent-DecodingTrust
  • AlexNet
  • AutoDAN
  • AutoPrompt
  • BERT
  • BLIP
  • Backdoor Attack
  • BackdoorAlign
  • CIFAR-10-C
  • CLIP
  • CMI-Attack
  • CNNs
  • Camelyon17-WILDS
  • ChatGPT
  • Claude-1
  • Claude-2
  • Concept Erasure
  • Conceptual Captions
  • DALL-E
  • DCLM-Baseline
  • DFN
  • DPO
  • DTP-Attack
  • DataComp
  • DataComp-1B
  • DataComp-LM
  • DecodingTrust
  • DenseNet
  • DiffPure
  • EfficientNet-B7
  • FMoW-WILDS
  • Flamingo
  • GCG
  • GPT-2
  • GPT-3
  • GPT-4
  • GPT-4-Turbo
  • GPT-4V
  • GradSafe
  • Greedy Coordinate Gradient (GCG)
  • HarmBench
  • HiDDeN
  • HotFlip
  • ImageNet
  • ImageNet-21k
  • ImageNet-Adversarial
  • ImageNet-R
  • ImageNet-Rendition
  • ImageNet-Sketch
  • ImageNetV2
  • Imp-v1-3b
  • Inception
  • Instagram 1B
  • JFT-300M
  • LAION-2B
  • LAION-5B
  • Llama 3 8B
  • Llama Guard
  • Llama-2 (Chat 7B)
  • MMD-DecodingTrust
  • Mask Patch
  • Mistral-7B-v0.3
  • NASNet
  • OAI-WIT-400M
  • ObjectNet
  • OpenReview
  • PG-Attack
  • PaLM-2 (Bard)
  • Qwen-vl-7b
  • RLHF
  • RedCaps
  • RedCaps12m
  • ResNeXt
  • ResNet
  • ShutterStock15M
  • Shutterstock
  • SneakyPrompt
  • Stable Diffusion
  • Stable Signature
  • StegaStamp
  • TDC
  • Tree-ring
  • VGG
  • ViT
  • ViTs
  • Vicuna-7B
  • Virtue AI
  • WISE-FT
  • WIT
  • WIT12m
  • YFCC
  • YFCC15M
  • YOLOv8
  • iWildCam-WILDS

Topics

AI Safety · AI regulation · AI trustworthiness · Adversarial Attacks · Autonomous Driving · Black-box Attacks · CLIP · Data-centric AI · DataComp · Distribution shifts · Foundation Models · Gradient-Based Attacks · Image Watermarking · ImageNet · LLM safety · Large Language Models · Prompt Injection · Reliable generalization · Responsible Disclosure · Robustness · Safety Alignment · Typographic Attacks · Vision Foundation Models (VFMs) · adversarial attacks · generative AI robustness · harmful content detection · multimodal models · prompt injection · watermarking


Notes

Open for commentary — connections to other work, critiques, follow-up reading.