While mislabeled or ambiguously-labeled samples in the training set could negatively affect the performance of deep models, diagnosing the dataset and identifying mislabeled samples helps to improve the generalization power. Training dynamics, i.e., the traces left by iterations of optimization algorithms, have recently been proven to be effective to localize mislabeled samples with hand-crafted features. In this paper, beyond manually designed features, we introduce a novel learning-based solution, leveraging a noise detector, instanced by an LSTM network, which learns to predict whether a sample was mislabeled using the raw training dynamics as input. Specifically, the proposed method trains the noise detector in a supervised manner using the dataset with synthesized label noises and can adapt to various datasets (either naturally or synthesized label-noised) without retraining. We conduct extensive experiments to evaluate the proposed method. We train the noise detector based on the synthesized label-noised CIFAR dataset and test such noise detectors on Tiny ImageNet, CUB-200, Caltech-256, WebVision, and Clothing1M. Results show that the proposed method precisely detects mislabeled samples on various datasets without further adaptation, and outperforms state-of-the-art methods. Besides, more experiments demonstrate that mislabel identification can guide a label correction, namely data debugging, providing orthogonal improvements of algorithm-centric state-of-the-art techniques from the data aspect.
- pytorch >=1.7.1
- torchvision >= 0.4
- scikit-learn
- numpy
- pandas
We run experiments on 5 small datasets...
- cifar10
- cifar100
- tiny imagenet
- cub 200 2011
- caltech256
... and 2 large datasets
- webvision50(subset of webvision50)
- clothing100k(subset of clothing 1M)
We use the same subset as AUM for these two subsets. Click Here to download and untar the file to access Cub-200-2011 and Caltech256.
- Acquisition of metadata and training dynamics (short for td) for manually corrupted or real-world datasets.
- Training an LSTM model as a detector
- Retraining new model on clean data, including two parts as follow:
- Metrics of label noise detection on synthesized datasets (CIFAR-10/100, Tiny ImageNet) and Retraining new model on clean data
- Less overfitting towards noisy labels on real-world datasets (WebVision50 and Clothing100K)
STEP1: Acquisition of metadata and training dynamics (short for td) for manually corrupted or real-world datasets.
generate_td.sh <datadir> <dataset> <seed> <noise_ratio> <noise_type> <net_type> <depth>
# run to get td for small datasets [no manual corruption]
CUDA_VISIBLE_DEVICES=0 ./generate_td.sh "/root/codespace/datasets" "cifar10" 1 0. "uniform" "resnet" 32
# run to get td for small datasets [uniform 0.2 noisy]
CUDA_VISIBLE_DEVICES=0 ./generate_td.sh "/root/codespace/datasets" "tiny_imagenet" 1 0.2 "uniform" "resnet" 32
# run to get td for large datasets [noise_ratio and noise_type are mute]
CUDA_VISIBLE_DEVICES=0 ./generate_td.sh "/root/codespace/datasets" "webvision50" 1 0. "uniform" "resnet" 50
The arguments:
<datadir>
- a path of datasets folder be like:|-- datasets |-- cifar10 | |-- cifar-10-batches-py | | |-- data_batch_1 | | |-- ... |-- cifar100 | |-- cifar-100-python | | |-- meta | | |-- ...
<dataset>
- default =cifar10
indicates which dataset to use<seed>
- default =0
indicates the random seed<noise_type>
- default =uniform
indicates which type of noise,uniform
meanssymmetric
andflip
meansasymmetric
<noise_ratio>
- default =0.2
indicates how many labels are corrupted<net_type>
- default =resnet
indicates which model to apply, can be modified in /models<depth>
- default =32
indicates depth of model. For example, the depth of resnet32 is 32.<result_save_path>
- default = 'replication' indicates where to save experiments
The script
generate_td.sh
calls class_Dataset
to corrupt dataset chosen with certain seed,noise_type and noise_ratio and calls functionRunner.train_for_td_computation
which saves the metadata(corruption information) and train a classification model to acquire training dynamics. Both of them can be found inrunner.py
After running this, the code will save all the following in one folder named computation4td_seed{seed}
.
- model.pth --> best model
- model.pth.last --> last model
- train_log.csv --> record of the training process
| epoch | train_error | train_loss | valid_error | valid_top5_error | valid_loss |
- results_valid.csv --> sample-wised validation results
| index | Loss | Prediction | Confidence | Label |
- metadata.pth --> corruption information
OrderedDict([
('train_indices',tensor([45845, ..., 8475])),
('valid_indices',tensor([], dtype=torch.int64)),
('true_targets', tensor([56, ..., 67])),
('label_flipped',tensor([False, ..., True]))])
- training_dynamics.npz --> saved td file
{
td:{}, - type: array, shape:
[number of samples in training,
N(GT + top-(N-1) average probabilities among all classes of all epochs),
training length]
labels:{}, - type: array, shape:
[number of samples in training,
N(labels of GT + top-(N-1) probabilities)]
}
# train a 2-layer lstm with noisy 0.2 cifar10
CUDA_VISIBLE_DEVICES=0 python train_detector.py --r 0.2 --dataset cifar10 --files_path "./replication/cifar10_resnet32_percmislabeled0.2_uniform/computation4td_seed1"
# fine tuning a 2-layer lstm with noisy 0.2 cub based on cifar10_0.2_lstm_detector.pth.tar
CUDA_VISIBLE_DEVICES=0 python train_detector.py --r 0.2 --dataset cub_200_2011 --files_path "./replication/cifar10_resnet34_percmislabeled0.2_uniform/computation4td_seed1" --resume "cifar100_0.3_lstm_detector.pth.tar"
2 common LSTM models as default, each one is OK for both CIFAR-10 or CIFAR-100 task, but is better when corresponds to task:
cifar10_0.2_lstm_detector.pth.tar
better for cifar10cifar100_0.3_lstm_detector.pth.tar
better for cifar100 and used in noide detection of Clothing100K and Webvision50.
Metrics of label noise detection on synthesized datasets (CIFAR-10/100, Tiny ImageNet) and Retraining new model on clean data
The script
small_dataset_sym_denoise.sh
calls the functionrunner.Runner.train
. This function begins withrunner.Runner.subset
that detects the mislabeled samples and divides the original training set into clean and noisy sets. Meanwhile, the metrics of ROC, andmAP of identifying mislabeled samples are also reported by callingget_order
fromfrom detector_models.predict
.
After detection, the function
runner.Runner.train
uses the clean set to train a new model. (The amount of this part depends onremove_ratio
.) We note that the functionrunner.Runner.train
requiresmetadata.pth
,training_dynamics.npz
and<detector_files>
, where the first two come from Step 1 and the third comes from Step 2.
small_dataset_sym_denoise.sh <datadir> <dataset> <seed> <noise_ratio> <noise_type> <result_save_path> <detector_file> <remove_ratio>
# run to detect-divide target dataset and retrain the model
detector_files='cifar10_0.2_lstm_detector.pth.tar'
# run to denoise sym cifar10
for remove_ratio in 0.15 0.2 0.25
do
CUDA_VISIBLE_DEVICES=0 ./small_dataset_sym_denoise.sh "/root/codespace/datasets" "cifar10" 1 0.2 "uniform" ${detector_files} ${remove_ratio}
done
# run to denoise asym cifar100
for remove_ratio in 0.35 0.4 0.45
do
CUDA_VISIBLE_DEVICES=0 ./small_dataset_sym_denoise.sh "/root/codespace/datasets" "cifar100" 1 0.4 "asym" ${detector_files} ${remove_ratio}
done
After ranking all training samples, the function
runner.Runner.train
selects a more clean set to train a new model. (The amount of this part depends onremove_ratio
.) We note that the functionrunner.Runner.train
requirestraining_dynamics.npz
and<detector_files>
, where the first two come from Step 1 and the third comes from Step 2. After running this, the code will save all the followings in another folder named{net_name}_prune4retrain_seed{seed}
.
large_dataset_denoise.sh <datadir> <dataset> <seed> <result_save_path> <detector_file> <remove_ratio>
# run to detect-divide target dataset and retrain the model
detector_files='cifar100_0.3_lstm_detector.pth.tar'
remove_ratio=0.2
# run to denoise WebVision50
CUDA_VISIBLE_DEVICES=0 ./large_dataset_denoise.sh "/root/codespace/datasets" "webvision50" 1 ${detector_files} ${remove_ratio}
# run to denoise Clothing100K
CUDA_VISIBLE_DEVICES=0 ./large_dataset_denoise.sh "/root/codespace/datasets" "clothing100k" 1 ${detector_files} ${remove_ratio}
Arguments:
<detector_files>
- noise detector instanced by 2-layers LSTM trained in Step2.<remove_ratio>
- the ratio of samples we removed, which are believed to be noisy/mislabeled samples.
Output:
- model.pth --> best model trained by clean part
- model.pth.last --> last model trained by clean part
- train_log.csv --> record of the training process
| epoch | train_error | train_loss | valid_error | valid_top5_error | valid_loss |
- results_valid.csv --> sample-wised validation results
| index | Loss | Prediction | Confidence | Label |
In Chapter 4.3, we apply a data degging strategy to further boost SOTA performance. Using a detector trained by noisy CIFAR-100, we first select several samples with the most suspicion as label noise. We train a new MODEL on the clean part of the dataset. The labels of these samples are then replaced by error-free ones (using ground truth labels for CUB and MODEL prediction for Webvision), namely data debugging, recorded in cub_200_2011/ and mini_webvision/. For replication, the only difference we made is the label of datasets. Based on the source code and instructions of DivideMix and AugDesc, we mainly modify the datasets' labels reading part. We provide the modified dataloader.py and trainer for experiments on sym CUB-200-2011 and mini Webvision to boost DivideMix.
If you make use of our work, please cite our paper:
@article{jia2022learning,
title={Learning from Training Dynamics: Identifying Mislabeled Data Beyond Manually Designed Features},
author={Jia, Qingrui and Li, Xuhong and Yu, Lei and Bian, Jiang and Zhao, Penghao and Li, Shupeng and Xiong, Haoyi and Dou, Dejing},
journal={arXiv preprint arXiv:2212.09321},
year={2022}
}
The implementation is based on AUM code. Part of experiments is based on DivideMix and AugDesc Thanks for their brilliant works!