The 3rd Explainable AI for Computer Vision (XAI4CV) Workshop @ CVPR 2024

Event: CVPR 2024 · Duration: 406 min · ▶ Watch on YouTube

Abstract

This segment covers the opening remarks and two invited talks from the 3rd Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR 2024. The workshop organizers introduce themselves and outline the goals, schedule, and logistics for the day. The first invited talk by Prof. Bernt Schiele presents “Inherent Interpretability for Deep Learning in Computer Vision,” introducing B-cos Networks for faithful and interpretable explanations in CNNs and Vision Transformers. The second invited talk by Prof. Kristian Kersting discusses “Reasonable Artificial Intelligence,” advocating for a blend of explainable and interpretable AI, exploring programmatic interpretability, and emphasizing the role of human interaction and safeguards in AI development. This segment features a series of short spotlight presentations from the XAI4CV 2024 workshop, covering diverse topics in Explainable AI. Talks include advancements in visual explanations for object detectors in remote sensing, exploring explainability in video action recognition, and quantifying consistency in diffusion model image generation. Other presentations delve into cross-modal learning, causal interpretation frameworks for CNNs, human-AI interaction for classification, attention-based pooling for interpretability, and explaining privacy decisions in images. This segment features several presentations from the XAI4CV @ CVPR 2024 workshop, covering diverse topics in explainable AI. Talks include methods for quantifying diffusion model consistency, enhancing cross-modal embedding interchangeability, visual interpretation frameworks for CNNs, and interactive human-AI collaboration. A key highlight is Tim Miller’s invited talk on human-centered counterfactual explanations, followed by Su-In Lee’s presentation on the application and future directions of XAI in medical image analysis, including auditing AI models for COVID-19 detection and skin cancer diagnosis. This segment features a series of spotlight talks on explainable AI (XAI) in computer vision. Presentations cover diverse topics including novel metrics for quantifying explainability, interactive tools for interpreting large vision-language models, and approaches to assess consistency in diffusion models. The segment also delves into frameworks for understanding CNN decisions based on causality, evaluating human-AI team performance with interactive explanations, and explaining privacy-related model decisions. Finally, it introduces methods for debugging classifiers using diffusion models and discusses the broader context of understanding, controlling, and debiasing text-to-image models. This segment features two talks on explainable AI and generative models. The first talk by Prof. Leonid Sigal addresses the challenges of prompt engineering and bias in text-to-image models, proposing a prompt inversion method to understand and control model outputs, and a bias mitigation framework called TIBET. The second talk by Prof. Vineeth N Balasubramanian focuses on transitioning from post-hoc to ante-hoc explainability, introducing a causal regularization method (CREDO) to integrate domain priors into neural networks for more reliable and robust explanations, including direct, indirect, and total causal effects.

Speakers

Indu Panigrahi — Princeton
Sunnie S. Y. Kim — Princeton
Vikram Ramaswamy — Princeton
Sukrut Rao — MPI-INF
Lenka Tětková — DTU
Pushkar Shukla — TTIC
Katelyn Morrison — CMU
Stefan Kolek — LMU Munich
Jawad Tayyub — Endress+Hauser
Deepti Ghadiyaram — Runway
Prof. Bernt Schiele — Max Planck Institute for Informatics
Prof. Kristian Kersting — TU Darmstadt
Ivica Obadic — Technical University of Munich
Saeed Kuhi — Munich Center for Machine Learning, LMU Earth
Toshinori Yamauchi — Hitachi, Ltd. Research & Development Group
Shashank Gupta — The University of Texas at Austin
Brinnae Bent — Duke University
Mohammad Reza Taesiri — Princeton University
Felipe Torres — Centrale Marseille, Aix Marseille Univ, CNRS, LIS, France
Myriam Bontounou — Queen Mary University of London
Tai Nguyen
Xiwei Xuan — University of California, Davis and National Taiwan University
Tim Miller — The University of Queensland
Su-In Lee — Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle
Anthony D Rhodes — Intel Labs
Matthew Lyle Olson — Multimodal Cognitive AI, Intel Labs
Brinnae Bent, PhD — Duke University
Ziquan Deng — University of California, Davis
Maximilian Augustin — Tübingen AI Center - University of Tübingen
Prof. Leonid Sigal — University of British Columbia
Prof. Vineeth N Balasubramanian — Indian Institute of Technology, Hyderabad

Talks (24)

00:08:34 — Prof. Bernt Schiele: Inherent Interpretability for Deep Learning in Computer Vision
- This talk introduces B-cos Networks, a novel architecture designed for inherent interpretability in deep learning, emphasizing dynamic linearity for faithfulness and alignment pressure for interpretability, and demonstrates its application to CNNs and Vision Transformers.
00:39:24 — Prof. Kristian Kersting: Reasonable Artificial Intelligence
- This talk advocates for “Reasonable Artificial Intelligence” by blending explainable and interpretable AI, exploring programmatic interpretability through differentiable logic, and highlighting the importance of human-in-the-loop interaction and safeguards in AI development.
01:14:15 — Ivica Obadic: Recent Trends, Challenges, and Limitations of Explainable AI in Remote Sensing
- This talk provides an overview of recent trends, challenges, and limitations of Explainable AI (XAI) in remote sensing, highlighting its increasing usage and the need to address unique properties of remote sensing imagery.
01:21:08 — Saeed Kuhi: Recent Trends, Challenges, and Opportunities in Explainable AI for Remote Sensing
- Summarizes trends like increasing usage of xAI for critical applications and challenges like interpretable neural networks and unique properties of RS data in remote sensing.
01:36:08 — Toshinori Yamauchi: Spatial Sensitive Grad-CAM++: Improved Visual Explanation for Object Detectors via Weighted Combination of Gradient Map
- Introduces visual explanations for object detectors, highlighting the need for instance-specific heat maps and proposing SSGrad-CAM++ to improve heat map quality by incorporating a weighted combination of gradient maps.
02:10:08 — Shashank Gupta: Exploring Explainability in Video Action Recognition
- Discusses challenges of applying attribution methods to video tasks due to temporal aspects and proposes Video-TCAV as a more robust alternative, also introducing an automated method for concept generation.
02:42:17 — Brinnae Bent: Semantic Approach to Quantifying the Consistency of Diffusion Model Image Generation
- This talk introduces a semantic approach to quantifying the consistency of diffusion model image generation using multimodal embedding models and CLIP scores, highlighting variability across models and the impact of fine-tuning.
02:43:14 — Tai Nguyen: Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data (ICLR 2024)
- This talk presents C3, a method to enhance embedding interchangeability across modalities (image, audio, video) by collapsing modality gaps and addressing alignment noise, enabling cross-modal tasks from uni-modal data.
02:44:27 — Xiwei Xuan: SUNY: A Visual Interpretation Framework for Convolutional Neural Networks from a Necessary and Sufficient Perspective
- This talk introduces SUNY, a visual interpretation framework for CNNs that rationalizes model decisions from a causal sufficiency and necessity perspective, outperforming existing saliency map methods.
02:44:57 — Mohammad Reza Taesiri: Allowing humans to interactively guide machines where to look does not always improve human-AI team’s classification accuracy
- This talk investigates whether interactive AI explanations improve human-AI team classification accuracy, finding that while interactivity helps detect AI errors in certain cases, it does not consistently improve overall human decision-making accuracy.
02:45:17 — Felipe Torres: CA-Stream: Attention-based pooling for interpretable image recognition
- This talk presents CA-Stream, an attention-based pooling mechanism that improves interpretability measurements on existing CNN models by introducing a cross-attention stream that updates a class-agnostic representation.
02:45:53 — Tim Miller: Human-centred counterfactual explanations for image classification
- This talk discusses the importance of human-centered counterfactual explanations in AI, drawing insights from social sciences to develop methods that provide actionable and interpretable explanations, particularly for image classification.
02:46:17 — Su-In Lee: Explainable AI for medical image AI: where we are and how to move forward
- This talk provides an overview of explainable AI (XAI) applications in medical image analysis, focusing on methods like SHAP for feature attribution, counterfactual explanations, and concept-based explanations, and discusses their utility in improving clinical outcomes and auditing AI models.
04:03:50 — Anthony D Rhodes: Quantifying Explainability with Multi-Scale Gaussian Mixture Models
- This talk introduces XMGD, a novel explainability comparison metric that uses multi-scale Gaussian Mixture Models to quantify and compare saliency maps, offering robustness against pixel-level changes and dataset size variations.
04:06:20 — Matthew Lyle Olson: LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models
- This presentation introduces LVLM-Interpret, an interactive web-based tool for visualizing and analyzing Large Vision-Language Models (LVLMs) through raw attention, relevancy maps, and causal interpretation, helping users understand model behavior and failure cases.
04:08:08 — Speaker not identified: Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
- Introduces C3 (Connect, Collapse, Corrupt), a method enhancing embedding interchangeability by addressing modality gap and alignment noise for cross-modal tasks from uni-modal data.
04:09:05 — Brinnae Bent, PhD: Semantic Approach to Quantifying the Consistency of Diffusion Model Image Generation
- This talk presents a Semantic Consistency Score (SCS) using CLIP embeddings to quantify the consistency of diffusion model image generation, enabling reconciliation of creativity and consistency for various applications.
04:12:30 — Ziquan Deng: SUNY: A Visual Interpretation Framework for Convolutional Neural Networks from a Necessary and Sufficient Perspective
- This presentation introduces SUNY, a visual interpretation framework for CNNs based on causal necessity and sufficiency, using N-S Shapley values to quantify feature importance and provide class-discriminative explanations.
04:25:14 — Myriam Bontounou: Explaining models relating objects and privacy
- This presentation introduces a framework to explain privacy model decisions by identifying relevant objects and features using integrated gradients, and proposes person-centric classification strategies to address bias towards people.
04:29:05 — Maximilian Augustin: DIG-IN: Diffusion Guidance for Investigating Networks
- This talk presents DIG-IN, a plug-and-play image generation framework for classifier debugging that uses latent diffusion models to find systematic classifier errors, generate visual counterfactual explanations, and visualize neuron semantics.
04:34:05 — Prof. Leonid Sigal: Understanding, Control and Debiasing of Text-to-Image Models
- This talk discusses understanding, controlling, and debiasing text-to-image models, highlighting the importance of addressing biases in generated content.
05:08:08 — Speaker not identified: A Visual Interpretation Framework for Convolutional Neural Networks from a Necessary and Sufficient Perspective
- Presents SUNY, a visual interpretation framework for CNNs that rationalizes CNNs from a causal sufficiency and necessity perspective, providing class-discriminative explanations.
06:09:18 — Prof. Vineeth N Balasubramanian: Moving beyond an Afterthought: Toward Learning via Explanations
- This talk advocates for moving beyond post-hoc explainability to ante-hoc methods, integrating causal domain priors into deep learning models for more reliable and robust explanations.
09:08:08 — Speaker not identified: DiG-IN: Diffusion Guidance for Investigating Networks
- Introduces DiG-IN, a plug-and-play image generation framework for classifier debugging, optimizing a classifier-derived objective to generate images and visualize neuron activations.

Key Takeaways

Inherent interpretability in deep learning can be achieved through architectural design, providing faithful and interpretable explanations without relying on post-hoc methods.
The future of AI lies in combining learning and reasoning, leading to composable, open, scalable, and understandable systems, with a strong emphasis on human interaction and ethical alignment.
Explainable AI is increasingly crucial in remote sensing for critical applications like flood detection and agricultural monitoring, but faces challenges due to the unique properties of remote sensing imagery.
Developing robust methods for correcting AI explanations and incorporating human expertise into the learning loop are vital for building trustworthy and aligned AI systems.
XAI approaches are being adapted to address the unique properties of remote sensing imagery, including multi/hyperspectral data and temporal resolution.
Temporal aspects pose significant challenges for attribution methods in video tasks, necessitating robust alternatives and instance-specific heat maps for object detection.
Quantifying the consistency of diffusion model image generation is a growing area, with methods like CLIP-based scores used to evaluate model reliability.
Novel frameworks like C3 and SUNY aim to enhance cross-modal learning and provide causal interpretations for CNNs, respectively, while human-AI collaboration in classification tasks requires careful evaluation.
Explainable AI is being applied to sensitive areas like privacy, with methods developed to explain model decisions based on detected objects.
Explainable AI is crucial for understanding and trusting complex models, especially in high-stakes domains like medicine.
Interactivity in AI explanations does not always guarantee improved human decision-making accuracy, highlighting the need for careful design.
Counterfactual explanations, which explain why an event occurred rather than an alternative, are a powerful tool for human-centered XAI.
Auditing AI models with XAI techniques can reveal hidden biases and “shortcuts” in model reasoning, preventing misapplication in clinical settings.
Foundation models, combined with medical literature, offer a promising avenue for developing transparent and interpretable AI systems in healthcare.
Robust and quantitative metrics like XMGD and SCS are crucial for comparing and evaluating different XAI methods, especially in complex domains like diffusion models and deepfake detection.
Interactive tools and frameworks, such as LVLM-Interpret and SUNY, enhance the understanding of complex AI models by providing insights into attention mechanisms, relevancy, causal relationships, and necessary/sufficient features.
Human-AI collaboration with interactive explanations does not always lead to improved decision-making accuracy, highlighting the need for careful design and evaluation of XAI systems in real-world applications.
Novel approaches are being developed to address specific challenges in XAI, including explaining privacy-related model decisions, debugging classifiers using generative models, and understanding/debiasing text-to-image models.
Prompt engineering is a significant challenge in text-to-image generation, requiring iterative refinement and deep understanding of model biases and vocabulary.
Bias in text-to-image models can stem from real-world biases, incidental correlations, and training data/procedures, leading to misrepresentation and requiring robust mitigation strategies.
Ante-hoc explainability, which integrates explanations directly into the model training process, offers advantages in reliability and robustness compared to traditional post-hoc methods.
Causal regularization (CREDO) can effectively incorporate domain priors and causal relationships into neural networks, allowing models to learn and maintain specific direct and indirect causal effects, leading to more interpretable and trustworthy AI systems.

Methods / Models / Datasets Mentioned

ACDE
Ablation-CAM
Attention Heatmaps
Attention networks
B-cos Networks
BERT Text Encoder
Backpropagation
BagNet
C3 (Connect, Collapse, Corrupt)
CA-Stream
CHM-Corr
CLIP
CLIP Text Encoder
CNN
COMPAS
CREDO
Causal Interpretation
CexCNN
Clustering
CoDA-Nets
Concept Bottleneck Memory Models (CBM)
Counterfactuals
DIG-IN
DINO
Deep SHAP
DeiSAM
DiG-IN
Differentiable Forward Reasoner
Differentiable Logic
Differentiable loss
Diffusion Models
Embedding Space
Energy Objective
Example-based
Feature Selection
GNN based inference
GPT-3
Gaussian Mixture Models (GMM)
Grad-CAM
Grad-CAM++
GradCAM
GradCAM++
Gradio UI
Graph Neural Networks
GraphNex
GroundedSAM
Group-CAM
ILLUME
ITI-GEN
ImageBind
Inception_v3
Integrated Gradients
Joint Training
L1 Objective
LBFGS
LIME
LLaVA
LVLM-Interpret
Large Language Model
Latent Diffusion
Latent Diffusion Model (LDM)
Layer-CAM
LlavaGuard
LoRA
Local Approximation
Logical PPO policy
MEPS
MONET (Medical concept retriever)
Mechanistic Architecture Design (MAD) pipeline
Midjourney
Model Approximation
N-S Shapley Values
NMF (Non-negative Matrix Factorization)
NeSy
Neural PPO policy
Object extractor
Occlusion
PCA (Principal Component Analysis)
PH2P
PPO policy
Perturbation
Pix2Code
PixArt-Alpha
PixArt-α
Program Synthesis
Projected Gradient Descent
Proto2Proto
RISE
Relevancy Maps
ResNet18
ResNet50
SCM (Structural Causal Model)
SEEM
SGD
SHAP
SHAP (SHapley Additive exPlanations)
SSGrad-CAM
SUNY
SUNY Framework
Safe Diffusion
Scene Graph Generator
Score-CAM
ScoreCAM
Segmentation Model
Self-Supervised Learning
Semantic Consistency Score (SCS)
Semantic Unifier
SmoothGrad
Sora
Stable Diffusion
Stable Diffusion XL
Stable Diffusion XL (SDXL)
T5 Text Encoder
TCAV
TIBET
Transformer
Transformer Shapley
Tree SHAP
U-Net
VAE
VGG-11
VGG16
VQA on MiniGPT-V2
Video Swin Transformer (Tiny)
Video-TCAV
XMGD (Explainable Multi-Scale GMM Distance)
YOLO-MLP
YOLOv7

Topics

AI Auditing · Ante-hoc Explainability · Bias Mitigation · Causal Inference · Causal Regularization · Computer Vision · Concept-based Explanations · Convolutional Neural Networks (CNNs) · Counterfactual Explanations · Deep Learning · Diffusion Models · Explainable AI (XAI) · Generative Models · Human-AI Collaboration · Human-AI Interaction · Human-in-the-loop AI · Interpretability · Interpretable AI · Medical Image Analysis · Model Consistency · Multimodal Embeddings · Neural Networks · Neuron Visualization · Object Detection · Privacy · Privacy Models · Prompt Engineering · Remote Sensing · Saliency Maps · Text-to-Image Models · Video Action Recognition · Vision Transformers · Vision-Language Models (LVLMs)

Notes

Open for commentary — connections to other work, critiques, follow-up reading.