DeeperCluster: Leveraging Large-Scale Uncurated Data for Unsupervised Pre-training of Visual Features
This code implements the unsupervised pre-training of convolutional neural networks, or convnets, as described in Leveraging Large-Scale Uncurated Data for Unsupervised Pre-training of Visual Features. We denote our method by DeeperCluster.
We provide for download the following models:
- DeeperCluster model trained on the full YFCC100M dataset;
- DeepCluster [2] model trained on 1.3M images subset of the YFCC100M dataset;
- RotNet [3] model trained on the full YFCC100M dataset;
- RotNet [3] model trained on ImageNet dataset without labels.
All these models follow a standard VGG-16 architecture with batch-normalization layers.
Note that in Deep/DeeperCluster models, sobel filters are computed within the models as two convolutional layers (greyscale + sobel filters).
The models expect RGB inputs that range in [0, 1]. You should preprocess your data before passing them to the released models by normalizing them: mean_rgb = [0.485, 0.456, 0.406]
; std_rgb = [0.229, 0.224, 0.225]
.
Method / Dataset | YFCC100M | ImageNet |
---|---|---|
DeeperCluster | ours | - |
DeepCluster | deepcluster_yfcc100M trained on 1.3M images | deepcluster_imagenet (found here) |
RotNet | rotnet_yfcc100M | rotnet_imagenet |
To automatically download all models you can run:
$ ./download_models.sh
- Python 3.6
- PyTorch install 1.0.0
- Apex with CUDA extension
- Faiss GPU install
- Download YFCC100M dataset. The ids of the 95.920.149 images we managed to download can be found here.
wget -c -P ./src/data/ "https://dl.fbaipublicfiles.com/deepcluster/flickr_unique_ids.npy"
The script main.sh
will run our method. Here is a screenshot:
python main.py
## handling experiment parameters
--dump_path ./exp/ # Where to store the experiment
## network params
--pretrained PRETRAINED # Use this instead of random weights
## data params
--data_path DATA_PATH # Where to find YFCC100M dataset
--size_dataset 100000000 # How many images to use for training
--workers 10 # Number of data loading workers
--sobel true # Apply Sobel filter
## optim params
--lr 0.1 # Learning rate
--wd 0.00001 # Weight decay
--nepochs 100 # Number of epochs to run
--batch_size 48 # Batch size per process
## model params
--reassignment 3 # Reassign clusters every this epoch
--dim_pca 4096 # Dimension of the pca on the descriptors
--super_classes 16 # Total number of super-classes
--rotnet true # Network needs to classify large rotations
## k-means params
--k 320000 # Total number of clusters
--warm_restart false # Use previous centroids as init
--use_faiss true # Use faiss for E step in k-means
--niter 10 # Number of k-means iterations
## distributed training params
--world-size 64 # Number of distributed processes
--dist-url DIST_URL # Url used to set up distributed training
You can look the training full documentation up with python main.py --help
.
This implementation has been specifically designed for multi-GPU and multi-node training and tested up to 128 GPUs distributed accross 16 nodes of 8 GPUs each.
The total number of GPUs used for an experiment (world_size
) must be divisible by the total number of super-classes (super_classes
).
Hence, exactly a total of super_classes
training communication groups of world_size / super_classes
GPUs each are created.
The parameters of a sub-class classifier specific to a super-class are shared within the corresponding training group.
Each training group deals only with the subset of images and the rotation angle associated with its corresponding super-class.
For this reason, computing batch statistics in the batch normalization layers for the entire batch (distributed accross the different training groups) is crucial.
We do so thanks to apex.
For the first stage of hierarchical clustering into nmb_super_clusters
clusters, the entire pool of GPUs is used.
Then for the second stage, we create nmb_super_clusters
clustering communication groups of world_size / nmb_super_clusters
GPUs each.
Each of these clustering groups independantly performs the second stage of hierarchical clustering on its corresponding subset of data (data belonging to the associated super-cluster).
For example, as illustrated below, let's assume we want to run a training with 8 super-classes and we have access to a pool of 16 GPUs. As many training distributed communication groups as the number of super-classes are created. This corresponds to creating 8 training groups (in red) of 2 GPUs. Moreover, the first level of the hierarchical k-means corresponds to the clustering of the data into 8/4=2 super-clusters. Hence, 2 clustering groups (in blue) are created.
You can have a look here for more details about how we define the different communication groups. The multi-node is automatically handled by SLURM.
To run an experiment with multiple GPUs on a single machine, simply replace python train.py in the command above with:
export NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py
You won't need to specify the world-size in this scenario.
Our implementation is generic enough to encompass both DeepCluster and RotNet trainings.
- DeepCluster: set
super_classes
to1
androtnet
tofalse
. - RotNet: set
super_classes
to4
,k
to1
androtnet
totrue
.
To reproduce our results on PASCAL VOC 2007 classification task run:
- FC6-8
python eval_voc_classif.py --data_path $PASCAL_DATASET --fc6_8 true --pretrained downloaded_models/deepercluster/ours.pth --sobel true --lr 0.003 --wd 0.00001 --nit 150000 --stepsize 20000 --split trainval
- ALL
python eval_voc_classif.py --data_path $PASCAL_DATASET --fc6_8 false --pretrained downloaded_models/deepercluster/ours.pth --sobel true --lr 0.003 --wd 0.0001 --nit 150000 --stepsize 10000 --split trainval
Running the experiment with 5 seeds.
There are different sources of randomness in the code: classifier initialization, ramdon crops for the evaluation and training with CUDA.
For more reliable results, we recommend to run the experiment several times with different seeds (--seed 36
for example).
Hyper-parameters selection.
We select the value of the different hyper-parameters (weight-decay wd
, learning rate lr
, and step-size stepsize
) by training on the train split and validating on the validation set.
To do so, simply use --split train
.
We train linear classifiers with a logistic loss on top of frozen convolutional layers at different depths. To reduce the influence of feature dimension in the comparison, we average-pool the features until their dimension is below 10k.
To reproduce our results from Table-3 run: ./conv13.sh
.
To reproduce our results from Figure-2 run: ./linear_classif_layers.sh
Learning rates. We use the learning rate decay recommended for linear models with L2 regularization by Leon Bottou in Stochastic Gradient Descent Tricks.
Hyper-parameters selection.
For experiments on Pascal, we select the value of the initial learning rate by training on the train split and validating on the validation set.
To do so, simply use --split train
.
For experiments on ImageNet and Places, this code implements k-fold cross-validation.
Simply set --kfold 3
for 3-fold cross-validation.
Then set --cross_valid 0
for training on splits 1 and 2 and validating on split 0 for example.
Checkpointing and distributed training. This code implements automatic checkpointing and is adapted to distributed training on multi-gpus and/or multi-nodes.
To reproduce our results on the pre-training for ImageNet experiment (Table-2) run:
mkdir -p ./exp/pretraining_imagenet/
export NGPU=1; python -m torch.distributed.launch --nproc_per_node=$NGPU eval_pretrain.py --pretrained ./downloaded_models/deepercluster/ours.pth --sobel true --sobel2RGB true --nepochs 100 --batch_size 256 --lr 0.1 --wd 0.0001 --dump_path ./exp/pretraining_imagenet/ --data_path $DATAPATH_IMAGENET
Checkpointing and distributed training. This code implements automatic checkpointing and is specifically intended for distributed training on multi-gpus and/or multi-nodes. The results in the paper for this experiment are obtained with training on 4 GPUs (the batch size per GPU is 64 in this case).
[1] M. Caron, P. Bojanowski, J. Mairal, A. Joulin Leveraging Large-Scale Uncurated Data for Unsupervised Pre-training of Visual Features
@article{caron2019leveraging,
title={Leveraging Large-Scale Uncurated Data for Unsupervised Pre-training of Visual Features},
author={Caron, Mathilde and Bojanowski, Piotr and Mairal, Julien and Joulin, Armand},
journal={arXiv preprint arXiv:1905.01278},
year={2019}
}
[2] M. Caron, P. Bojanowski, A. Joulin, M. Douze Deep clustering for unsupervised learning of visual features
@inproceedings{caron2018deep,
title={Deep clustering for unsupervised learning of visual features},
author={Caron, Mathilde and Bojanowski, Piotr and Joulin, Armand and Douze, Matthijs},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2018}
}
[3] S. Gidaris, P. Singh, N. Komodakis Unsupervised representation learning by predicting image rotations
@inproceedings{
gidaris2018unsupervised,
title={Unsupervised Representation Learning by Predicting Image Rotations},
author={Spyros Gidaris and Praveer Singh and Nikos Komodakis},
booktitle={International Conference on Learning Representations},
year={2018},
url={https://openreview.net/forum?id=S1v4N2l0-},
}
See the LICENSE file for more details.