Welcome to the Workshop on Responsible Data!

Event: CVPR 2024 Workshop on Responsible Data · Duration: 499 min · ▶ Watch on YouTube

Abstract

This video segment from the CVPR 2024 Workshop on Responsible Data introduces the workshop’s focus on ethical data practices in computer vision. It features several talks covering the challenges and opportunities in creating and using responsible datasets and benchmarks. Topics include the inherent biases in data, the impact of distribution shifts on model performance, and the ethical considerations surrounding large-scale pre-training and image anonymization. The speakers advocate for more dynamic, participatory, and context-aware approaches to data management and model evaluation to ensure fairness, privacy, and real-world impact. The segment features a round table discussion titled “Round Table #1” focusing on the complexities of creating responsible AI datasets. Participants delve into current challenges, including managing bias in data, obtaining proper consent for data usage, and the practicalities of large-scale data collection. The discussion also extends to future considerations, such as the ethical implications of scaling laws in AI development, the use of synthetic data, and the need for human-centric approaches in model evaluation, particularly in sensitive domains like medical data. This round table discussion, titled ‘Round Table #1’, delves into the multifaceted challenges surrounding risks, ethics, and data in the context of AI development. Participants explore the difficulty of assessing potential risks versus value, the impact of regulations like GDPR on data collection, and the complexities of ensuring diverse and representative datasets. The conversation touches upon the controversy surrounding datasets like LION, the responsibility for harmful content, and the potential of synthetic data as a solution, while acknowledging its own inherent biases. A key focus is the human element in data annotation, consent, and the broader societal implications of AI, questioning the distinction between ‘responsible AI’ and ‘normal AI’. This segment features William Agnew from Carnegie Mellon University presenting on the real-world applications of computer vision research. He highlights how a significant portion of computer vision research, particularly from CVPR papers, is being cited in patents related to surveillance and militarization. The presentation delves into the ethical responsibilities of researchers and the need for socio-technical foresight to align research with values, questioning the pervasive emphasis on generality in computer vision. This segment features four rapid talks on various aspects of responsible data. The first talk discusses data sharing policies in ecological applications, highlighting sovereignty concerns and the impact of data scale on ML progress. The second introduces a new deepfake detection dataset, DETER, which aims to address new risks from advanced generative AI. The third presents AI-EDI-SPACE, a co-designed dataset for evaluating public spaces, focusing on citizen engagement and ethical AI principles. The final talk proposes ‘Dataset Clinics’ as a community-based approach to data creation and management in Africa, emphasizing sovereignty, equity, and leveraging shared heritage for sustainable development. This segment concludes the workshop, emphasizing that data is the fundamental essence of AI, akin to the flowers in perfume. The speaker highlights that while much research focuses on models, the critical aspect of data often gets overlooked. He stresses the dynamic nature of data, referring to it as a ‘living asset’ that requires continuous effort in its design, collection, processing, and adaptation to evolving norms for responsible AI. The segment also announces the intention to compile the workshop’s discussions into a white paper.

Speakers

Candice Schumann
Dr. Sara Beery — MIT Faculty of Artificial Intelligence and Decision-Making
Ryosuke Yamada — University of Tsukuba, Japan
Luca Piano — Politecnico di Torino, Italy
William K.
Moderator
Participant 1
Participant 2
Participant 3
Participant 4
William Agnew — Carnegie Mellon University
Neha Hulkund — MIT CSAIL
Ye Zhu — Princeton University
Hugo Berard — Université de Montréal
Camille Minns — The Environmental Justice in Tech Project at Earth Hacks
Sanjana Paul — The Environmental Justice in Tech Project at Earth Hacks
Wilhelmina Ndapwea Onyothi Nekoto — MasakhanNLP
Organizer
Candice

Talks (12)

00:00:00 — Candice Schumann: Introduction to the Workshop on Responsible Data
- Introduction to the workshop, highlighting the importance of responsible data practices in computer vision beyond just model architectures.
00:03:35 — Dr. Sara Beery: Benchmarking Models in a Changing World
- Discussion on the critical role of data and benchmarks in computer vision, emphasizing biases, distribution shifts, and the need for dynamic, decision-making-focused, and participatory benchmarks for real-world applications like biodiversity monitoring.
00:46:15 — Ryosuke Yamada: Is ImageNet Pre-training Fair in Image Recognition?
- Investigation into the fairness of ImageNet pre-training, comparing MSL and SSL, finding SSL generally produces fairer outcomes and proposing synthetic pre-training as a solution to data concerns.
00:54:20 — Luca Piano: The role of image anonymization in balancing fairness and privacy in data collection: ethical and technical challenges
- Exploration of generative anonymization techniques for privacy protection and data utility, highlighting potential issues like artifacts, bias amplification, beautification, and stereotyped images in anonymized datasets.
01:04:05 — William K.: AI DATA ACCESS CONTROL (AI-DAC) in RDBMD: A Comprehensive Review
- Presentation of the AI-DAC framework, a triangular model for responsible AI development, emphasizing continuous data monitoring, context appraisal, and actions across diagnostic, development, and corrective loops.
01:23:13 — Moderator: Round Table #1: Challenges in Designing Responsible AI Datasets
- This round table discussion explores current and future challenges in designing responsible AI datasets, covering topics such as bias, consent, data collection, scaling laws, and the ethical implications of using large language models like GPT-4.
02:46:27 — Multiple participants: Round Table #1: Risks, Ethics, and Data in AI
- A round table discussion on the challenges of risk assessment, ethical data collection, policy implications, and the use of synthetic data in AI development, highlighting the human element and the evolving landscape of responsible AI.
04:09:41 — William Agnew: What is Computer Vision Used for?
- William Agnew discusses how computer vision research is applied in real-world scenarios, particularly its increasing entanglement with surveillance and militarization, and the ethical implications for researchers.
05:33:13 — Neha Hulkund: Data Sharing Policies in Ecological Applications
- This talk explores the promises and pitfalls of data sharing in ecological applications, focusing on ethical concerns, environmental impact, data representativeness, and the critical issue of data sovereignty, particularly in the context of formerly colonized nations.
05:36:47 — Ye Zhu: DETER: Detecting Edited Regions for Deterring Generative Manipulations
- This presentation introduces DETER, a large-scale dataset and benchmark for deepfake detection that addresses new challenges posed by advanced generative AI techniques, particularly focusing on mitigating spurious correlations and providing a unified evaluation across different granularities of manipulation.
05:40:41 — Hugo Berard: AI-EDI-SPACE: A Co-designed Dataset for Evaluating the Quality of Public Spaces
- This talk presents AI-EDI-SPACE, a co-designed dataset and methodology for evaluating the quality of public spaces, emphasizing citizen engagement, interdisciplinarity, and ethical AI principles to address issues like accessibility and discrimination, while ensuring local ownership and diverse representation in data collection.
06:33:57 — Camille Minns: Machines are Learning, African Communities are Training
- This presentation advocates for a transformative approach to data creation and management in Africa, emphasizing community-based, consent-driven ‘Dataset Clinics’ to leverage shared heritage for socio-economic benefit, challenging traditional data acquisition methods, and repositioning communities as architects of a responsible and equitable digital future.

Key Takeaways

Data quality and responsible practices are paramount in computer vision, often outweighing architectural innovations, especially for real-world applications.
All data is inherently biased, and these biases can be amplified or create unintended consequences if not carefully considered during dataset creation, model training, and evaluation.
Future benchmarks should be dynamic, decision-making-focused, participatory, and resource-aware to effectively address distribution shifts and ensure real-world impact.
Generative AI and synthetic data offer potential solutions for privacy and fairness concerns but introduce new challenges like artifacts, bias amplification, and the need to define what information is truly ‘non-sensitive’ or culturally significant.
Designing responsible AI datasets presents significant challenges, encompassing both technical aspects like bias mitigation and ethical considerations such as consent.
The discussion highlights a tension between “scaling laws” (emphasizing massive data collection) and “human-centric” approaches, suggesting a need for balance in AI development.
Future challenges involve navigating the ethical implications of increasingly powerful AI models and ensuring that data collection methods respect privacy and avoid perpetuating biases.
The group emphasizes the importance of understanding the context of data and the potential impact of AI systems on human lives, moving beyond purely quantitative metrics like accuracy.
The definition and assessment of AI risks are evolving, requiring a balance between potential harm and value, and often becoming clear only upon deployment.
Regulatory frameworks like GDPR significantly impact data collection practices, creating challenges for AI development, especially concerning sensitive data and international variations.
The quality and representativeness of datasets are crucial for ethical AI, with concerns raised about the presence of harmful content (e.g., in the LION dataset) and the difficulty of ensuring diversity while respecting privacy and consent.
Synthetic data offers a potential avenue to address data scarcity and privacy, but its generation must carefully consider inherent biases and ethical implications to avoid perpetuating existing problems.
The human element in data annotation, consent, and the broader societal impact of AI systems is often undervalued, leading to questions about fair compensation, the ethics of data usage, and the responsibility of developers and platforms.
A substantial and growing portion of computer vision research, particularly from CVPR, is being applied in surveillance and militarization technologies.
Patents offer an imperfect but valuable window into the real-world applications and downstream impacts of academic research.
Researchers should consider the ethical implications of their work and use socio-technical foresight to align their research with their values, rather than solely pursuing generality.
There is a need for greater transparency and accountability regarding the specific applications of computer vision research, especially concerning human data and its use in sensitive contexts like border surveillance and predictive policing.
Data sharing in ecological applications requires careful consideration of data sovereignty, especially for formerly colonized nations, and local data is often more valuable than distant data for improving model performance.
New generative AI techniques introduce novel risks in deepfake creation, necessitating new detection datasets and benchmarks that mitigate spurious correlations and provide unified evaluation across different granularities of manipulation.
Developing AI tools for public spaces requires active citizen participation and co-design methodologies, incorporating diverse perspectives and ensuring local ownership to create more equitable and representative algorithms.
A transformative approach to data creation in Africa, exemplified by ‘Dataset Clinics,’ can empower communities as knowledge producers and developers, leveraging shared heritage to drive socio-economic aspirations and ensure responsible, sustainable data practices.
The core of AI’s functionality and behavior is intrinsically linked to the data used in its models.
Effective management of data, from design and collection to processing and storage, is crucial for building responsible AI systems.
Data is a dynamic entity, not static, and must be continuously adapted and refreshed to align with changing norms and requirements for responsible use.
The insights and discussions from this workshop will be documented and shared in a forthcoming white paper.

Methods / Models / Datasets Mentioned

AI-DAC
AI-EDI-SPACE
Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities
BLIP
Bloomberg
CAMOUFLAGE
CARE principles
CIFAR
CLIP
COCO
CUB 200
CVPR
Can Generalist Foundation Models Outcompete Special-Purpose Tuning?
CelebA-HQ
ControlNet
DETER
DINO
Data Trusts
DeepForest
DeepPrivacy2
DiffIR
DiffSwap
Documentation Frameworks
FAIR principles
FALCO
FGCVx Fungi
FPIC
FairFace
FasterR-CNN
Formula-Driven Supervised Learning (FDSL)
GAN based MAT
GAN-based E4S
GDPR
GPT-4
GeoDE
IMDB-WIKI
IRIS
ImageNet
JFT-300M
Jigsaw
LAION
LAION 5B
LION dataset
Latent Diffusion Models (LDMs)
MAE
MNIST
MaskFormer
MaskR-CNN
Microsoft Academic Graph
MoCo
Montreal Declaration for Responsible AI
NEONCROWNS Dataset
Reliance on Science dataset
Sankofa
SimCLR
Stable Diffusion
TACO
TRAK: Attributing Model Behavior at Scale
The CARE Principles for Indigenous Data Governance
VAE Decoder
WILDS dataset
YOLO
iNaturalist

Topics

AI Ethics · AI ethics · AI regulation · African Communities · Citizen Engagement · Computer Vision Applications · Computer Vision Benchmarks · Data Bias · Data Sharing · Data Sovereignty · Data bias · Data collection challenges · Data consent · Data ethics · Data governance · Data lifecycle · Data management · Data quality · Dataset Design · Deepfake Detection · Distribution Shift · Ecological Applications · Ethical AI · Fairness in AI · Generality in AI · Generative AI · Image Anonymization · Large language models · Militarization · Patent Analysis · Privacy Protection · Public Spaces · Research Responsibility · Responsible AI · Responsible AI datasets · Responsible Data · Scaling laws · Socio-technical Foresight · Surveillance Technology · Synthetic Data · Synthetic data · Workshop conclusion · consent · data governance · data privacy · dataset bias · risk assessment · synthetic data

Notes

Open for commentary — connections to other work, critiques, follow-up reading.