Peng Sun, Bei Shi, Daiwei Yu, Tao Lin
This is an official PyTorch implementation of the paper On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm (Preprint 2023). In this work, we:
- We delineate three key objectives for effective dataset distillation on large-scale high-resolution datasets: realism, diversity, and efficiency.
- We introduce the compression rate of information and a realism score backed by
$\mathcal{V}$ -information theory, together with an optimization-free efficient paradigm, to condense diverse and realistic data. - Extensive experiments substantiate the effectiveness of our method: it can distill the full ImageNet-1K to a small dataset comprising 10 images per class within 7 minutes, achieving a notable 42% top-1 accuracy with ResNet-18 on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours).
Contemporary machine learning requires training large neural networks on massive datasets and thus faces the challenges of high computational demands. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggle with large-scale and high-resolution datasets, hindering its practicality and feasibility. To this end, we re-examine the existing dataset distillation methods and identify three properties required for large-scale real-world applications, namely, realism, diversity, and efficiency. As a remedy, we propose RDED, a novel computationally-efficient yet effective data distillation paradigm, to enable both diversity and realism of the distilled data.
- Give the format for dataset structuring.
- Separate the processes of validation and relabeling.
torchvision==0.16.0
torch==2.1.0
The main entry point of a single experiment is main.py
. To facilitate experiments running, we provide scripts
for running the bulk experiments in the paper. For example, to run RDED
for condensing ImageNet-1K into small dataset with
bash ./scripts/imagenet-1k_10ipc_resnet-18_to_resnet-18_cr5.sh
Following SRe$^2$L, we adapt official Torchvision code to train the observer models from scratch. All our pre-trained observer models listed below are available at link.
Dataset | Backbone | Top1-accuracy | Input Size |
---|---|---|---|
CIFAR10 | ResNet18 (modified) | 93.86 | 32 |
CIFAR10 | Conv3 | 82.24 | 32 |
CIFAR100 | ResNet18 (modified) | 72.27 | 32 |
CIFAR100 | Conv3 | 61.27 | 32 |
Tiny-ImageNet | ResNet18 (modified) | 61.98 | 64 |
Tiny-ImageNet | Conv4 | 49.73 | 64 |
ImageNet-Nette | ResNet18 | 90.00 | 224 |
ImageNet-Nette | Conv5 | 89.60 | 128 |
ImageNet-Woof | ResNet18 | 75.00 | 224 |
ImageNet-Woof | Conv5 | 67.40 | 128 |
ImageNet-10 | ResNet18 | 87.40 | 224 |
ImageNet-10 | Conv5 | 85.4 | 128 |
ImageNet-100 | ResNet18 | 83.40 | 224 |
ImageNet-100 | Conv6 | 72.82 | 128 |
ImageNet-1k | Conv4 | 43.6 | 64 |
All our raw datasets, including those like ImageNet-1K and CIFAR10, store their training and validation components in the following format to facilitate uniform reading using a standard dataset class method:
/path/to/dataset/
├── 00000/
│ ├── image1.jpg
│ ├── image2.jpg
│ ├── image3.jpg
│ ├── image4.jpg
│ └── image5.jpg
├── 00001/
│ ├── image1.jpg
│ ├── image2.jpg
│ ├── image3.jpg
│ ├── image4.jpg
│ └── image5.jpg
├── 00002/
│ ├── image1.jpg
│ ├── image2.jpg
│ ├── image3.jpg
│ ├── image4.jpg
│ └── image5.jpg
This organizational structure ensures compatibility with the unified dataset class, streamlining the process of data handling and accessibility.
If you find this repository helpful for your project, please consider citing our work:
@InProceedings{sun2024diversity,
title={On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm},
author={Sun, Peng and Shi, Bei and Yu, Daiwei and Lin, Tao},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}
Our code has referred to previous work: