Dataset Distillation: A Comprehensive Review

Event: 1st CVPR Dataset Distillation Workshop June 2024 · Duration: 462 min · ▶ Watch on YouTube

Abstract

This segment provides a comprehensive overview of Dataset Distillation (DD), a technique for synthesizing compact datasets to train models efficiently. It explores various applications of DD, including neural architecture search, continual learning, federated learning, and privacy-preserving data sharing. The talk details different DD methodologies such as performance matching, parameter matching, and distribution matching, comparing them with related fields like instance selection, knowledge distillation, and generative models. Experimental evaluation protocols and visualizations of distilled images are presented, along with discussions on computational costs, architecture dependency, and future research directions beyond image classification and single modality. This segment introduces Data-Centric AI as a crucial approach to address challenges in modern AI development, particularly concerning data efficiency, generalization, and trust. The speaker highlights that current large language models are reaching data saturation, making efficient data utilization paramount. Two main strategies are discussed: learning to compress (data condensation) and learning to generate (data augmentation). The talk delves into the limitations of standard data condensation methods, specifically their poor generalization across different hyperparameters, and proposes a solution called Hyperparameter Calibrated Data Condensation (HCDC) which introduces a condensed validation set to ensure comparable model rankings. Finally, the speaker emphasizes the importance of aligning AI with human values and boosting trust, noting that a significant portion of the public does not yet trust AI technologies. This video segment delves into data-centric AI, focusing on three key pillars: compression, generation, and trust. It introduces novel methods for efficient hyperparameter search using comparable validation losses (HCDC) and for fixing problematic data augmentations (SAFLEX) by learning sample weights and soft labels. The segment also highlights challenges in evaluating multimodal large language models (MLLMs) with benchmarks like Mementos and Easy2Hard-Bench, and explores the detectability of AI-generated content and the robustness of image watermarks (WAVES). Finally, it addresses LLM safety through data-centric red-teaming, presenting AutoDAN for autonomous adversarial attacks and Shadowcast for data poisoning. This segment features a comprehensive overview of recent advancements in dataset distillation, presented by Xindi Wu, followed by a talk by Longzhen Li. Xindi Wu’s presentation covers four distinct research papers, delving into memory addressing formulations for efficient distillation, novel methods for distilling vision-language datasets using bi-trajectory matching and LoRA, and an in-depth analysis of the nature and properties of distilled data. The segment concludes with Longzhen Li’s presentation on generative dataset distillation, emphasizing the balance between global structure and local details. This segment features four presentations on dataset distillation, pruning, and video summarization. The first talk introduces a coreset selection method for object detection, emphasizing consistency in global and local details. The second presents DEEPDISTAL, a deepfake dataset distillation framework utilizing active learning to reduce dataset size while maintaining performance. The third discusses large-scale dataset pruning with dynamic uncertainty, focusing on identifying “learnable” data for lossless compression. Finally, the fourth talk introduces CheckMATE, an efficient video summarization technique that converts videos into images for action classification, demonstrating strong results even at ultra-low resolutions. This segment features two presentations from the CVPR 2024 Workshop. The first talk introduces Intrinsic-LoRA (I-LoRA), a novel and efficient method for extracting various scene intrinsics from diverse generative models, highlighting the correlation between the quality of extracted intrinsics and the generative model’s visual quality. The second presentation explores the impact of dataset bias on dataset distillation, demonstrating how existing methods fail with biased datasets and proposing a new formulation to address this challenge by extracting unbiased attributes.

Speakers

Hakan Bilen
Furong Huang — University of Maryland
Xindi Wu — Princeton University
Zhiwei Deng — Google Deepmind
Olga Russakovsky — Princeton University
Byron Zhang — Princeton University
William Yang — Princeton University
Ye Zhu — Princeton University
Tian Qin — Harvard University
David Alvarez-Melis — Harvard University
Longzhen Li — Hokkaido University
Guang Li — Hokkaido University
Ren Togo — Hokkaido University
Keisuke Maeda — Hokkaido University
Takahiro Ogawa — Hokkaido University
Miki Haseyama — Hokkaido University
Suyoung Kim — Seoul National University
Md Shohel Rana — Florida Gulf Coast University
Muyang He — Peking University
Masud An-Nur Islam Fahim — University of Vaasa, Finland
Xiaodan Du — Toyota Technological Institute at Chicago (TTIC)
Nicholas Kolkin — Adobe
Greg Shakhnarovich — Adobe
Anand Bhattad — Adobe
Yao Lu — Institute of Cyberspace Security, Zhejiang University of Technology; Zhejiang University
Jianyang Gu — Zhejiang University
Xuguang Chen — Zhejiang University
Saeed Vahidian — Duke University
Qi Xuan — Zhejiang University

Talks (19)

00:00:00 — Hakan Bilen: Dataset Distillation: A Comprehensive Review
- This talk provides a comprehensive overview of Dataset Distillation (DD), exploring its applications, methodologies, related work, evaluation protocols, and future research directions in the context of efficient model training in the big data era.
01:16:57 — Furong Huang: Advancing AI with Data-Centric Strategies - Boosting Efficiency, Generalization, and Trust
- This talk introduces Data-Centric AI, focusing on learning to compress and generate data to improve efficiency, generalization, and trust in AI models, while addressing challenges like data saturation and hyperparameter generalization.
02:33:55 — Furong Huang: HCDC: Comparable Validation Losses
- This talk introduces HCDC, a method that learns a synthetic validation set by matching hyperparameter gradients of validation losses, ensuring comparable validation-performance rankings between synthetic and original datasets to facilitate efficient hyperparameter search.
03:09:16 — Furong Huang: The Classical Data “Generation”: Augmentation and Fixing “Bad” Augmentations: SAFLEX
- This section discusses the limitations of traditional data augmentation for non-natural images (like medical images) due to “impossible data” and “wrong labels,” and introduces SAFLEX, an automated pipeline that learns sample weights and soft labels to fix problematic augmentations based on validation performance.
04:25:47 — Furong Huang: Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences and Easy2Hard-Bench
- This part presents “Mementos,” a benchmark revealing current MLLMs’ struggles with reasoning over image sequences in daily-life and comic domains, and “Easy2Hard-Bench,” which provides standardized difficulty labels for profiling LLM performance and generalization across various tasks like code, math, and chess.
05:08:05 — Suyoung Kim: Coreset Selection for Object Detection
- Presents a novel coreset selection method for object detection that considers both global structure and local details, showing improved distillation performance.
05:12:25 — Md Shohel Rana: DEEPDISTAL: Deepfake Dataset Distillation using Active Learning
- Introduces DEEPDISTAL, a deepfake dataset distillation framework that leverages active learning for efficient model training, demonstrating comparable performance with significantly reduced dataset sizes.
05:17:35 — Muyang He: Large-scale Dataset Pruning with Dynamic Uncertainty
- Proposes a method for large-scale dataset pruning based on dynamic uncertainty, aiming for lossless compression by identifying more “learnable” data points.
05:22:06 — Furong Huang: On the Possibilities of AI-Generated Text Detection and Benchmarking the Robustness of Image Watermarks
- This segment explores the detectability of AI-generated content, asserting that detection is always possible with sufficient observations, and introduces “WAVES,” a benchmark for evaluating the robustness of image watermarks against 26 types of attacks, including distortion, regeneration, and adversarial methods.
05:34:49 — Furong Huang: What does safety mean in LLMs? Data-Centric Red-Teaming?
- This final section advocates for data-centric red-teaming to address LLM safety, introducing AutoDAN for autonomously generating jailbreaks, prompt leaking, and denial-of-service attacks, and Shadowcast for stealthy data poisoning against vision-language models.
06:16:10 — Masud An-Nur Islam Fahim: CheckMATE: Efficient Video Summarization by Checking Mutually Averaged Temporal Encapsulation
- Introduces CheckMATE, a method for efficient video summarization that converts videos into summary images for action classification, achieving competitive performance with reduced computational cost.
06:25:08 — Xiaodan Du: INTRINSIC LORA: A GENERALIST APPROACH FOR DISCOVERING KNOWLEDGE IN GENERATIVE MODELS
- Presents Intrinsic-LoRA (I-LoRA), a novel and efficient method for extracting various scene intrinsics from diverse generative models, highlighting the correlation between intrinsic quality and model quality.
06:27:08 — Yao Lu: Exploring the Impact of Dataset Bias on Dataset Distillation
- Investigates the influence of dataset bias on dataset distillation methods, proposing a new formulation to extract and retain unbiased attributes while minimizing the impact of biased ones.
08:05:49 — Xindi Wu: Scaling Down before Scaling Up: Recent Progress on Dataset Distillation
- An overview of dataset distillation, its challenges, and recent progress, covering four specific research works.
08:12:24 — Xindi Wu: Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks
- Introduces a memory addressing formulation for dataset distillation, enabling information sharing across classes and higher compression rates, with critical additions of momentum and long unrolls in BPTT.
08:17:26 — Xindi Wu: Vision Language Dataset Distillation
- Discusses distilling vision-language datasets, which lack discrete classes, proposing a bi-trajectory matching method with contrastive loss and leveraging low-rank adaptation (LoRA) for efficient training with complex models like ViTs.
08:32:04 — Xindi Wu: What is Dataset Distillation Learning?
- Explores the nature of distilled data, its substitutability for real data, the type of information captured, and whether individual data points carry meaningful information, finding that distilled data captures similar information to real data but is more sensitive during training.
08:38:51 — Xindi Wu: A Label is Worth A Thousand Images In Dataset Distillation
- Investigates the driving force behind successful dataset distillation, identifying soft labels as a key factor and establishing an empirical data-knowledge scaling law, showing expert knowledge is equivalent to 6x data size reduction.
08:48:16 — Longzhen Li: Generative Dataset Distillation: Balancing Global Structure and Local Details
- Introduces a generative dataset distillation method (DIM) that focuses on distilling original dataset information into a generative model, proposing a method that considers both global structure matching and local feature matching to improve distillation performance and generalization.

Key Takeaways

Dataset Distillation aims to create small, highly informative synthetic datasets that can train models as effectively as large original datasets, offering significant benefits in various machine learning scenarios.
Different DD methodologies exist, focusing on matching model performance, parameter space solutions, or data distributions, each with its own advantages and computational trade-offs.
DD is a rapidly evolving field with ongoing research to improve scalability, reduce computational costs, enhance architecture agnosticism, and extend its application beyond image classification and single-modality data.
Distilled images, while not always realistic, are highly informative for model training, and the field is exploring ways to generate more complex and multi-modal synthetic data.
Data-Centric AI focuses on improving data efficiency and quality for better model performance and trust.
Current large language models are facing data saturation, making efficient data utilization critical.
Standard Data Condensation (SDC) methods struggle with generalization across different hyperparameters, leading to a need for new approaches.
Hyperparameter Calibrated Data Condensation (HCDC) is proposed to address this by ensuring comparable model rankings on condensed validation sets.
Building trust in AI is crucial, and Data-Centric AI strategies can contribute to this goal by improving model reliability and transparency.
Data-centric approaches, such as HCDC and SAFLEX, can significantly improve model training and hyperparameter optimization by focusing on the quality and relevance of data.
Current MLLMs still face significant challenges in complex reasoning tasks, especially with image sequences, highlighting the need for more robust evaluation benchmarks like Mementos and Easy2Hard-Bench.
Detecting AI-generated content is theoretically always possible with sufficient observations, and robust watermarking techniques are crucial for ensuring trust in AI-generated media, though they are vulnerable to various attacks.
LLM safety can be enhanced through data-centric red-teaming, using autonomous tools like AutoDAN to identify and mitigate vulnerabilities such as jailbreaks, prompt leaking, and data poisoning attacks.
Dataset distillation is a powerful technique to reduce dataset size while preserving model performance, crucial for resource-intensive large models.
Memory addressing formulations and bi-trajectory matching with LoRA are effective strategies for distilling complex datasets, including vision-language pairs.
Distilled data captures essential information from real data, but its properties differ, requiring careful consideration for training and generalization.
Soft labels play a critical role in the success of dataset distillation, and an empirical data-knowledge scaling law quantifies the efficiency gains.
Coreset selection can effectively reduce dataset size for object detection while preserving performance by focusing on global structure and local details.
Active learning can be a powerful tool for dataset distillation in complex tasks like deepfake detection, enabling significant computational savings.
Dynamic uncertainty can guide large-scale dataset pruning to achieve lossless compression by identifying data points that are “learnable” but not too easy or too hard.
Video summarization into images, combined with image classifiers, offers an efficient approach to video action classification, even for ultra-low resolution videos.
Intrinsic-LoRA (I-LoRA) offers an efficient way to extract physical scene intrinsics from various generative models, requiring minimal new parameters and samples.
The quality of scene intrinsics extracted using I-LoRA is directly correlated with the visual quality of the underlying generative model.
Dataset bias significantly impacts the performance of existing dataset distillation methods, necessitating new approaches that explicitly address and mitigate bias.
A proposed formulation for biased dataset distillation aims to extract and retain unbiased attributes while minimizing the influence of biased ones, showing improved performance on biased datasets.

Methods / Models / Datasets Mentioned

AVD
AlexNet
AutoDAN
BERT
BPTT
Batch Normalization (BN)
CIFAR10-DD
CLIP
CMNIST-DD
CSOD
ChatUniVI
CheckMATE
Chinchilla Scaling Laws
Classification-by-description
Claude3-Opus
Colored MNIST
ConvNeXt
ConvNet3
Corrupted CIFAR10
DALL-E3
DC (Gradient-matching based DD)
DEEPDISTAL
DIM
DM (Distribution-matching based DD)
DSA (Gradient-matching based DD)
DVSR
Dataset Cartography
DenseNet
Diffusion Models
Easy2Hard-Bench
FRePo-JAX
FineWeb
Fitnets
GLISTER
GPT-4
GPT-4 Turbo
GPT-4V
Gemini1.5-Pro
Generative Adversarial Networks (GANs)
Graph Sampling (RALF CVPR 2012)
HCDC
Hyperparameter Calibrated Data Condensation (HCDC)
I3D
IFS-3D
ImageNet-1K
ImageNet-21K
InstructBLIP
Intrinsic-LoRA (I-LoRA)
K-NN graph
KL-VAE(f8)
Kernel Herding
Knowledge Distillation
LGD-3D
LLM-generated descriptors
Llama 2
Llama 3
Llama-2
Low-Rank Adaptation (LoRA)
Mementos
MiniGPT4
Mistral-8x22B
Mixup
NFNet
Prog. DVSR
Qwen1.5-110B
R(2+1)D
RandAugment
RegNet
ResNet
ResNet-18
ResNet18
ResNet50
Ridge Regression
S3D
SAFLEX
SGD
STM
Shadowcast
SoSR
Squeeze, Recover, Relabel (SRR)
Stable Diffusion
Stable Diffusion UNet
Stable Signature
Standard Data Condensation (SDC)
StegaStamp
StyleGAN v2
StyleGAN-XL
StyleGAN-v2
Swin Transformer
TSN
Tree-Ring
Two-stream
VGG11
VGG16
VQGAN
Variational Autoencoders (VAEs)
ViT
WAVES
Waffle-CLIP
Wasserstein distance
Xception
Zero-shot knowledge distillation
k-means clustering
k-median clustering
mPLUG-Owl-v2

Topics

AI Trust · AI-Generated Content Detection · Action classification · Active learning · Architecture Agnosticism · Bi-Trajectory Matching · Computational Efficiency · Continual Learning · Coreset selection · Data Augmentation · Data Condensation · Data Distillation · Data Efficiency · Data Poisoning · Data Saturation · Data-Centric AI · Data-Knowledge Scaling Law · Dataset Bias · Dataset Condensation · Dataset Distillation · Dataset distillation · Dataset pruning · Deepfake detection · Distribution Matching · Efficient Model Training · Federated Learning · Generative Dataset Distillation · Generative Models · Hyperparameter Optimization · Image Classification · Image Watermarking · Instance Selection · Knowledge Distillation · LLM Safety and Red-Teaming · LoRA · Low-Rank Adaptation (LoRA) · Memory Addressing Formulation · Model Efficiency · Model Evaluation Benchmarks · Model Generalization · Multi-modal Learning · Multimodal Large Language Models (MLLMs) · Neural Architecture Search · Object detection · Parameter Matching · Performance Matching · Privacy-preserving AI · Scaling Laws · Scene Intrinsics · Semantic Segmentation · Soft Labels · Video summarization · Vision-Language Datasets

Notes

Open for commentary — connections to other work, critiques, follow-up reading.