IEEE CVPR workshop on Fair, Data Efficient and Trusted Computer Vision

Event: CVPR 2024 Workshop · Duration: 421 min · ▶ Watch on YouTube

Abstract

This segment features several talks from the IEEE CVPR 2024 Workshop on Fair, Data Efficient and Trusted Computer Vision. It begins with an introduction to the workshop and its organizers, followed by a presentation on DIA, a diffusion-based inverse network attack. The keynote speaker then provides a 15-year review of scale learning in image semantics, covering the evolution from ImageNet to CLIP. Subsequent talks introduce Fast-NTK for parameter-efficient machine unlearning and AR-CP, an uncertainty-aware perception system for assisted driving in adverse conditions. This segment introduces a novel regularization method designed to enforce conditional independence for fair representation learning and causal image generation. The core idea is to make a classifier’s predictions, such as sex classification, invariant to confounding attributes like skin type. The method leverages a Jensen-Shannon divergence-based approach to measure and minimize the similarity between probability distributions, effectively decoupling the prediction from the sensitive attribute. This technique is integrated into existing image encoders, acting as a regularizer to improve fairness and reduce bias in machine learning models. This segment features four distinct talks on critical aspects of fair, data-efficient, and trusted computer vision. The first talk introduces a method for conditionally independent causal representation learning, applying it to diffusion autoencoders for bias mitigation in image generation. The second talk explores strategies for enabling opt-out and incentivizing opt-in for text-to-image models, presenting ‘Custom Diffusion’ for concept addition and ‘Concept Ablation’ for concept removal, alongside methods for data attribution. The third talk delves into the origins and mitigation of biases in diffusion models, proposing ‘Distribution-Guided Debiasing’ to achieve fair generation by guiding the denoising process with reference attribute distributions and analyzing geographical inclusivity. Finally, the fourth talk presents ‘Guided Loss-Increasing (GLI)’, a novel data augmentation method for efficient machine unlearning that prevents catastrophic model utility drop while ensuring compliance with privacy regulations. This segment introduces a new benchmark, Human 3.6M-C (H3.6M-C), for evaluating the robustness of 3D human pose estimation (HPE) models on corrupted videos. The benchmark is derived from the Human 3.6M dataset using six types of RGB video corruption operators. The talk proposes two methods: Temporal Additive Gaussian Noise (TAGN) for improving generalization under unforeseen corruption (Scenario 1) and Confidence-Aware Convolution (CAC) for enhancing learning efficiency with foreseen corruption (Scenario 2). Experimental results demonstrate that both TAGN and CAC effectively reduce MPJPE on the H3.6M-C dataset. This segment features multiple presentations on various topics in computer vision. One talk introduces RLNet for efficient private inference, focusing on reducing ReLU operations and enhancing robustness. Another presentation details a Dual-Carriageway Framework for robust and explainable fine-grained visual classification, addressing data challenges in industrial settings. Additionally, a data-free defense mechanism against adversarial attacks on black-box models is presented, utilizing wavelet noise removal and a regenerator network. The use of fractals as pre-training datasets for anomaly detection is also explored, showing improved model performance. Finally, a novel method called SkipPLUS is introduced to enhance the interpretability of Vision Transformers by skipping initial layers, and a Brain Inspired Vision Transformer (CP-ViT) is presented, integrating brain network organization principles for improved performance and interpretability.

Speakers

Nalini Ratha — University of Buffalo
Chenghao Li — University of Southern California
Liangliang Cao — Apple
Guihong Li — UT Austin
Achref Doula — Technical University of Darmstadt, Germany
Jensen Hwa — Stanford University
Aditya Lahiri — Stanford University
Qingyu Zhao — Stanford University
Adnan Masood — Stanford University
Babak Salimi — Stanford University
Ehsan Adeli — Stanford University
Richard Zhang — Adobe Research, SF
R. Venkatesh Babu — IISc Bangalore
Dasol Choi — Yonsei University
Trung-Hieu Hoang — University of Illinois at Urbana-Champaign
Phan Nhat Huy — VinUniversity, Ha Noi, Vietnam
Mona Zehni — University of Illinois at Urbana-Champaign
Duc Minh Vo — The University of Tokyo, Japan
Minh N. Do — University of Illinois at Urbana-Champaign
Sreetama Sarkar — University of Southern California, Los Angeles, USA
Souvik Kundu — Intel Labs, San Diego, USA
Peter A. Beerel — University of Southern California, Los Angeles, USA
Zheming Zuo — Newcastle University, UK
Joseph Smith — Newcastle University, UK
Jonathan Stonehouse — Procter and Gamble, UK
Boguslaw Obara — Newcastle University, UK
Gaurav Kumar Nayak — Indian Institute of Science, Bangalore, India
Inder Khatri — Indian Institute of Science, Bangalore, India
Ruchit Rawal — Indian Institute of Science, Bangalore, India
Anirban Chakraborty — Indian Institute of Science, Bangalore, India
Cynthia Ugwu — Free University of Bozen-Bolzano
Sofia Casarin — Free University of Bozen-Bolzano
Oswald Lanz — Free University of Bozen-Bolzano
Feraidoon Mehri — Sharif University of Technology
Lu Zhang — Indiana University Indianapolis

Talks (16)

00:05:09 — Chenghao Li: DIA: Diffusion based Inverse Network Attack on Collaborative Inference
- This talk introduces DIA, a diffusion-based inverse network attack designed to reconstruct private client data from intermediate features in collaborative inference systems, highlighting vulnerabilities in Vision Transformers and the need for improved privacy-preserving technologies.
00:22:35 — Liangliang Cao: Scale Learning in Image Semantics: A 15-Year Review
- This keynote provides a 15-year review of scale learning in image semantics, tracing the evolution from ImageNet to CLIP, discussing advancements in datasets, models, and loss functions, and exploring the challenges and opportunities in leveraging computation and data for more efficient and robust models.
00:53:15 — Guihong Li: Fast-NTK: Parameter-Efficient Unlearning for Large-Scale Models
- This talk introduces Fast-NTK, a novel algorithm for efficient machine unlearning in large-scale neural networks. It leverages parameter-efficient fine-tuning (PEFT) by focusing on key parameters like Batch Normalization layers and visual prompts, demonstrating higher scalability and comparable accuracy to traditional retraining-based approaches.
01:07:30 — Achref Doula: AR-CP: Uncertainty-Aware Perception in Adverse Conditions with Conformal Prediction and Augmented Reality For Assisted Driving
- This presentation introduces AR-CP, an approach for driver assistance in adverse weather conditions that combines uncertainty-aware classification using Conformal Prediction (CP) with Augmented Reality (AR) to reduce driver confusion and improve perception. The method reduces prediction set size using a class taxonomy and demonstrates significant improvements in accuracy, cognitive load, and situation awareness in user studies.
01:24:12 — Jensen Hwa: Enforcing Conditional Independence for Fair Representation Learning and Causal Image Generation
- This talk introduces a regularization method to enforce conditional independence for fair representation learning and causal image generation, focusing on making sex classification invariant to skin type.
02:48:25 — None: From Label Space to Latent Space
- This talk introduces a method for conditionally independent causal representation learning, applying it to diffusion autoencoders for bias mitigation in image generation by controlling attributes like skin type and sex.
02:55:25 — Richard Zhang: Incentivizing Opt-in & Enabling Opt-out for Text-to-Image Models
- This talk explores methods for enabling opt-out and incentivizing opt-in for text-to-image models, introducing ‘Custom Diffusion’ for concept addition, ‘Concept Ablation’ for concept removal, and ‘Attribution by Customization/Unlearning’ for data attribution.
03:14:05 — R. Venkatesh Babu: Uncovering and Addressing Biases in Diffusion Models
- This talk addresses the origins and mitigation of biases in diffusion models, proposing ‘Distribution-Guided Debiasing’ to achieve fair generation by guiding the denoising process with reference attribute distributions and analyzing geographical inclusivity.
03:24:00 — Dasol Choi: Towards Efficient Machine Unlearning with Data Augmentation: Guided Loss-Increasing(GLI) to Prevent the Catastrophic Model Utility Drop
- This talk presents ‘Guided Loss-Increasing (GLI)’, a novel data augmentation method for efficient machine unlearning that prevents catastrophic model utility drop while ensuring compliance with privacy regulations.
04:12:38 — Trung-Hieu Hoang: Improving the Robustness of 3D Human Pose Estimation: A Benchmark and Learning from Noisy Input
- This talk introduces a new benchmark (H3.6M-C) for 3D human pose estimation on corrupted videos and proposes two methods, TAGN and CAC, to improve robustness and learning efficiency under unforeseen and foreseen corruption scenarios, respectively.
05:36:51 — Sreetama Sarkar: RLNet: Robust Linearized Networks for Efficient Private Inference
- This talk introduces RLNet, a class of models designed to improve latency and model performance in private inference by reducing high-latency ReLU operations, while also enhancing robustness against adversarial and naturally perturbed images.
05:42:41 — Zheming Zuo: Robust and Explainable Fine-Grained Visual Classification with Transfer Learning: A Dual-Carriageway Framework
- This presentation introduces a Dual-Carriageway Framework (DCF) for robust and explainable fine-grained visual classification, addressing challenges like varying illuminance, imbalanced data, noisy textures, and camera bias in industrial datasets.
05:50:46 — Gaurav Kumar Nayak: Data-free Defense of Black Box Models Against Adversarial Attacks
- This talk presents a data-free defense mechanism against adversarial attacks on black-box models, utilizing a wavelet noise remover and a regenerator network to reconstruct images and improve robustness without access to original training data.
05:58:46 — Cynthia Ugwu: Fractals as Pre-training Datasets for Anomaly Detection and Localization
- This presentation explores the use of fractals as a novel pre-training dataset for anomaly detection and localization, demonstrating their effectiveness in improving model performance across various benchmark datasets, particularly for memory-based methods.
06:04:46 — Feraidoon Mehri: SkipPLUS: Skip the First Few Layers to Better Explain Vision Transformers
- This presentation introduces SkipPLUS, a novel method for improving the interpretability of Vision Transformers by skipping the first few layers during attribution, demonstrating enhanced class discriminativity and reduced noise in relevance maps compared to existing methods.
06:10:01 — Lu Zhang: Brain Inspired Vision Transformer (ViT)
- This talk introduces Brain Inspired Vision Transformer (CP-ViT), a novel architecture that integrates core-periphery organization principles from brain networks into Vision Transformers, demonstrating improved classification performance and interpretability across various datasets.

Key Takeaways

Machine unlearning is crucial for addressing data erasure requirements (e.g., GDPR) in large-scale models, and Fast-NTK offers an efficient, scalable solution by leveraging parameter-efficient fine-tuning.
Uncertainty-aware perception systems like AR-CP can significantly improve driver assistance in adverse conditions by providing guarantees over predictions and reducing cognitive load through augmented reality.
The evolution of image semantics over the past 15 years has been driven by increasingly large datasets and sophisticated models, with a shift towards leveraging computation and data, as highlighted by the “bitter lesson.”
Despite the trend towards larger models and datasets, strategic data filtering can lead to more efficient models with less data, as demonstrated by improved CLIP embeddings after removing text regions from training images.
The proposed method enforces conditional independence by minimizing the Jensen-Shannon divergence between specific probability distributions, making predictions invariant to confounding attributes.
This regularization technique can be added to any existing image encoder, providing a flexible way to improve fairness in representation learning.
The approach aims to make classifiers robust against biases introduced by sensitive attributes, ensuring that predictions are based solely on relevant features.
By enforcing conditional independence at the latent representation level, the method seeks to achieve a more fundamental and robust form of fairness within the model.
Novel methods like Custom Diffusion and Concept Ablation offer fine-grained control over generative AI models, allowing for efficient concept addition and targeted removal to address creator rights and ethical concerns.
Bias in diffusion models, stemming from real-world data, can be mitigated through techniques like Distribution-Guided Debiasing, which guides the denoising process to achieve fairer and more representative image generation across diverse attributes and geographies.
Machine unlearning is a critical area for privacy compliance, and methods like Guided Loss-Increasing (GLI) aim to make this process efficient while preventing the loss of overall model utility.
Data attribution techniques, such as Attribution by Customization and Unlearning, are being developed to understand the influence of specific training data on generated outputs, providing transparency and accountability in AI models.
A new benchmark (H3.6M-C) is proposed to evaluate 3D HPE models under various video corruptions, simulating real-world ‘in-the-wild’ conditions.
Temporal Additive Gaussian Noise (TAGN) is an effective data augmentation strategy for improving model generalization when corruption types are unforeseen during training.
Confidence-Aware Convolution (CAC) blocks can be integrated into 3D pose lifters to improve learning efficiency when corruption types are known and can be included in the training data.
The proposed methods significantly enhance the robustness of 2D-to-3D pose lifters against different types of video corruptions.
RLNet improves latency and model performance in private inference by reducing high-latency ReLU operations and enhancing robustness against adversarial and naturally perturbed images.
The Dual-Carriageway Framework (DCF) for fine-grained visual classification addresses challenges like varying illuminance, imbalanced data, noisy textures, and camera bias in industrial datasets, demonstrating improved performance through an automatic best-suit training pathway seeking framework.
A data-free defense mechanism for black-box models against adversarial attacks utilizes a wavelet noise remover to filter corrupted high-frequency coefficients and a regenerator network to reconstruct missing information, achieving significant improvements in adversarial accuracy.
Fractals can serve as effective pre-training datasets for anomaly detection and localization, offering a scalable and privacy-preserving alternative to real-world datasets, and demonstrating competitive performance across various benchmark datasets and anomaly detection methods.

Methods / Models / Datasets Mentioned

APS
AR-CP
AUPRO
AUROC
AlexNet
Alexnet
Amazon Rekognition
AnyDoor
AttCAM
AttnGV
AttnGrad
Attribution by Customization (AbC)
Attribution by Unlearning (AbU)
AtttCAM
Augmented Reality (AR)
Auto Attack
Autoformer
Autoregressive
BIM
BLIPDiffusion
Batch Normalization (BN) layers
CFLOW
CLIP
CNN
CP-ViT
CP-ViT-S
Caption Loss
Class Taxonomy
Collaborative Inference
Concept Ablation
Conditional Independence Formulation
Confidence-Aware Convolution (CAC)
Conformal Prediction (CP)
Contrastive Loss
Cross-entropy loss
Custom Diffusion
CutPaste
DALL-E 2
DecompX
Diffusion Autoencoder
Diffusion models
Distribution-Guided Debiasing (DGD)
Dreambooth
Dual-Carriageway Framework (DCF)
Dynamic Sampling Technique
E4T-Diffusion
EWC
Fast-NTK
FastComposer
FastFlow
Fractals
FullGrad
GANs
GenAtt
Gender Shades
GigaGAN
GlobEnc
GradCAM++
Guided Loss-Increasing (GLI)
HOG
HiResCAM
Human 3.6M (H3.6M)
Human 3.6M-C (H3.6M-C)
HyperDreamBooth
IP-Adapter
ITI Gen
ImageGPT
ImageNet
Imagen
Inception-v3
Instagram3.5B
Inverse Network Attack
JFT300M
Jensen-Shannon Divergence
LAION
LAION-400M
LAION-5B
MPJPE (Mean Per Joint Position Error)
MS-COCO
MURA VIT
MUSE
MVtec AD
MaskGit
Masked Transformer
Mixer-S
MobileNet
MobileNetV2
NegGrad
Neural Tangent Kernel (NTK)
PANDA
PGD
PaDiM
Parameter-Efficient Fine-Tuning (PEFT)
Parti
PatchCore
Prompt-based fine-tuning
RAPS
RD
RLNet
ROAD dataset
ReLU operations
Regenerator Network (Rn)
ResNet
ResNet-110
ResNet-18
ResNet-50
ResNet34
Resnet-18
SIFT
SISA
STFPM
SUTI
SVDiff
Safe Latent Diffusion
SkipPLUS
Softmax
StyleGAN
T2T-ViT-12
Temporal Additive Gaussian Noise (TAGN)
Textual Inversion
UNSIR
Ultra Fine-Grained Embedding
ViT-B
ViT-S
VisA
Vision Transformers (ViT)
Visual Tokenizer
Wavelet Coefficient Selection Module (WCSM)
Wavelet Noise Remover (WNR)
YFCC100M
ZipLoRA
iG

Topics

3D Human Pose Estimation (3D HPE) · Adversarial Robustness · Anomaly Detection · Assisted Driving · Augmented Reality · Bias Mitigation · Bias reduction · Brain-Inspired AI · Causal Representation Learning · Causal image generation · Computer Vision Workshops · Conditional independence · Data Attribution · Data Augmentation · Data-Free Defense · Deep Learning · Diffusion Models · Fair AI · Fair representation learning · Fine-Grained Visual Classification · Fractals · Generalization · Generative AI Ethics · Image Semantics · Invariance · Inverse Network Attacks · Jensen-Shannon divergence · Learning Efficiency · Machine Unlearning · Model Customization · Pose Lifter · Privacy in AI · Private Inference · Regularization · Robustness · Scale Learning · Text-to-Image Generation · Uncertainty-Aware Perception · Video Corruption · Vision Transformers · Wavelet Noise Removal

Notes

Open for commentary — connections to other work, critiques, follow-up reading.