The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
We provide three popular benchmarks on ImageNet-1k based on various network architectures. We also provide results on Tiny-ImageNet for fast experiments. The median of top-1 accuracy in the last 5/10 training epochs for 100/300 epochs is reported for ResNet variants, and the best top-1 accuracy is reported for Transformer architectures.
- You can start distributed training with a config file. Here is an example with 4 GPUs training on a single node. You can use
--load_checkpoint ${PATH}
(loading the full checkpoints fromPATH
),--auto_resume
(resuming from the latest model), and--resume_from ${PATH}
(resuming fromPATH
) in optional arguments.CUDA_VISIBLE_DEVICES=1,2,3,4 PORT=29001 bash tools/dist_train.sh ${CONFIG_FILE} 4 [optional arguments]
- If you have trained or downloaded a model, you can evaluate its classification performance. An example with 4 GPUs evaluation on a single node,
CUDA_VISIBLE_DEVICES=1,2,3,4 bash tools/dist_test.sh ${CONFIG_FILE} 4 ${PATH_TO_MODEL}
These benchmarks follow PyTorch-style settings, training 100 and 300 epochs from stretch based on ResNet variants on ImageNet-1k.
Supported mixup algorithms
- Mixup [ICLR'2018]
- CutMix [ICCV'2019]
- ManifoldMix [ICML'2019]
- FMix [ArXiv'2020]
- AttentiveMix [ICASSP'2020]
- SmoothMix [CVPRW'2020]
- SaliencyMix [ICLR'2021]
- PuzzleMix [ICML'2020]
- GridMix [Pattern Recognition'2021]
- ResizeMix [ArXiv'2020]
- AlignMix [CVPR'2022]
- TransMix [CVPR'2022]
- AutoMix [ECCV'2022]
- SAMix [ArXiv'2021]
- DecoupleMix [ArXiv'2022]
Setup
- Please refer to config files for experiment details: various mixups, AutoMix, SAMix. As for config files of various mixups, please modify
max_epochs
andmix_mode
inauto_train_mixups.py
to generate configs and bash scripts. - Since ResNet-18 might be under-fitted on ImageNet-1k, we adopt
$\alpha=0.2$ for some cutting-based mixups (CutMix, SaliencyMix, FMix, ResizeMix) based on ResNet-18. - Notice that 📖 denotes original results reproduced by official implementations.
Backbones | ResNet-18 | ResNet-34 | ResNet-50 | ResNet-101 | ResNeXt-101 | |
---|---|---|---|---|---|---|
Epochs | 100 epochs | 100 epochs | 100 epochs | 100 epochs | 100 epochs | |
Vanilla | - | 70.04 | 73.85 | 76.83 | 78.18 | 78.71 |
MixUp | 0.2 | 69.98 | 73.97 | 77.12 | 78.97 | 79.98 |
CutMix | 1 | 68.95 | 73.58 | 77.17 | 78.96 | 80.42 |
ManifoldMix | 0.2 | 69.98 | 73.98 | 77.01 | 79.02 | 79.93 |
SaliencyMix | 1 | 69.16 | 73.56 | 77.14 | 79.32 | 80.27 |
AttentiveMix+ | 2 | 68.57 | - | 77.28 | - | - |
FMix* | 1 | 69.96 | 74.08 | 77.19 | 79.09 | 80.06 |
PuzzleMix | 1 | 70.12 | 74.26 | 77.54 | 79.43 | 80.53 |
Co-Mixup📖 | 2 | - | - | 77.60 | - | - |
SuperMix📖 | 2 | - | - | 77.63 | - | - |
ResizeMix* | 1 | 69.50 | 73.88 | 77.42 | 79.27 | 80.55 |
AlignMix📖 | 2 | - | - | 78.00 | - | - |
Grafting📖 | 1 | - | - | 77.74 | - | - |
AutoMix | 2 | 70.50 | 74.52 | 77.91 | 79.87 | 80.89 |
SAMix* | 2 | 70.83 | 74.95 | 78.06 | 80.05 | 80.98 |
Backbones | ResNet-18 | ResNet-34 | ResNet-50 | ResNet-101 | |
---|---|---|---|---|---|
Epochs | 300 epochs | 300 epochs | 300 epochs | 300 epochs | |
Vanilla | - | 71.83 | 75.29 | 77.35 | 78.91 |
MixUp | 0.2 | 71.72 | 75.73 | 78.44 | 80.60 |
CutMix | 1 | 71.01 | 75.16 | 78.69 | 80.59 |
ManifoldMix | 0.2 | 71.73 | 75.44 | 78.21 | 80.64 |
SaliencyMix | 1 | 70.21 | 75.01 | 78.46 | 80.45 |
FMix* | 1 | 70.30 | 75.12 | 78.51 | 80.20 |
PuzzleMix | 1 | 71.64 | 75.84 | 78.86 | 80.67 |
ResizeMix* | 1 | 71.32 | 75.64 | 78.91 | 80.52 |
AlignMix📖 | 2 | - | - | 79.32 | - |
AutoMix | 2 | 72.05 | 76.10 | 79.25 | 80.98 |
SAMix* | 2 | 72.27 | 76.28 | 79.39 | 81.10 |
These benchmarks follow timm RSB A2/A3 settings based on ResNet-50, EfficientNet-B0, and MobileNet.V2. Training 300/100 epochs with the BCE loss on ImageNet-1k, RSB A3 is a fast training setting while RSB A2 can exploit the full representation ability of ConvNets.
Setup
- Please refer to config files for experiment details: RSB A3 and RSB A2. You can modify
max_epochs
andmix_mode
inauto_train_mixups.py
to generate configs and bash scripts for various mixups. - Notice that the RSB settings employ Mixup with
$\alpha=0.1$ and CutMix with$\alpha=1.0$ . We report the median of top-1 accuracy in the last 5/10 training epochs for 100/300 epochs.
Backbones | ResNet-50 | ResNet-50 | Eff-B0 | Eff-B0 | Mob.V2 1x | Mob.V2 1x | |
---|---|---|---|---|---|---|---|
Settings | A3 | A2 | A3 | A2 | A3 | A2 | |
RSB | 0.1, 1 | 78.08 | 79.80 | 74.02 | 77.26 | 69.86 | 72.87 |
MixUp | 0.2 | 77.66 | - | 73.87 | 77.19 | 70.17 | 72.78 |
CutMix | 0.2 | 77.62 | 79.38 | 73.46 | 77.24 | 69.62 | 72.23 |
ManifoldMix | 0.2 | 77.78 | 79.47 | 73.83 | 77.22 | 70.05 | 72.34 |
AttentiveMix+ | 2 | 77.46 | 79.34 | 72.16 | 75.95 | 67.32 | 70.30 |
SaliencyMix | 0.2 | 77.93 | 79.42 | 73.42 | 77.67 | 69.69 | 72.07 |
FMix* | 0.2 | 77.76 | 79.05 | 73.71 | 77.33 | 70.10 | 72.79 |
PuzzleMix | 1 | 78.02 | 79.78 | 74.10 | 77.35 | 70.04 | 72.85 |
ResizeMix* | 1 | 77.85 | 79.74 | 73.67 | 77.27 | 69.94 | 72.50 |
AutoMix | 2 | 78.44 | - | 74.61 | 77.58 | 71.16 | 73.19 |
SAMix | 2 | 78.64 | - | 75.28 | 77.69 | 71.24 | 73.42 |
Since recently proposed transformer-based architectures adopt mixups as parts of essential augmentations, these benchmarks follow DeiT settings based on DeiT-Small, Swin-Tiny, and ConvNeXt-Tiny on ImageNet-1k.
Setup
- Please refer to config files of various mixups for experiment details: DeiT, PVT, Swin, ConvNeXt, MogaNet. You can modify
max_epochs
andmix_mode
inauto_train_mixups.py
to generate configs and bash scripts for various mixups. - Notice that the DeiT setting employs Mixup with
$\alpha=0.8$ and CutMix with$\alpha=1.0$ . - Notice that the performances of transformer-based architectures are more difficult to reproduce than ResNet variants, and the mean of the best performance in 3 trials is reported as their original paper. Notice that 📖 denotes original results reproduced by official implementations.
Methods | DeiT-T | DeiT-S | PVT-T | Swin-T | ConvNeXt-T | MogaNet-T | |
---|---|---|---|---|---|---|---|
Vanilla | - | 75.66 | 80.21 | 79.22 | 79.25 | ||
DeiT | 0.8, 1 | 74.50 | 79.80 | 75.10 | 81.20 | 82.10 | 79.02 |
MixUp | 0.2 | 74.69 | 77.72 | 75.24 | 81.01 | 80.88 | 79.29 |
CutMix | 0.2 | 73.82 | 80.13 | 75.53 | 81.23 | 81.57 | 78.37 |
ManifoldMix | 0.2 | - | - | - | - | 80.57 | 79.07 |
AttentiveMix+ | 2 | 74.07 | 80.32 | 74.98 | 81.29 | 81.14 | 77.53 |
SaliencyMix | 0.2 | 79.88 | 75.71 | 81.37 | 81.33 | 78.74 | |
PuzzleMix | 1 | 73.85 | 80.45 | 75.48 | 81.47 | 81.48 | 78.12 |
FMix* | 0.2 | 74.41 | 77.37 | 75.28 | 79.60 | 81.04 | 79.05 |
ResizeMix* | 1 | 74.79 | 78.61 | 76.05 | 81.36 | 81.64 | 78.77 |
TransMix📖 | 0.8, 1 | 72.92 | 80.70 | 75.50 | 81.80 | - | - |
TokenMix📖 | 0.8, 1 | 75.31 | 80.80 | 75.60 | 81.60 | - | - |
AutoMix | 2 | 75.52 | 80.78 | 76.38 | 81.80 | 82.28 | 79.43 |
SAMix* | 2 | 80.94 | 76.60 | 81.87 | 82.35 |
We summarize mixup benchmarks in Model Zoo.
Please refer to the original paper of ImageNet and AutoMix for details.
@article{IJCV2015ImageNet,
title={ImageNet Large Scale Visual Recognition Challenge},
author={Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael S. Bernstein and Alexander C. Berg and Li Fei-Fei},
journal={International Journal of Computer Vision},
year={2015},
volume={115},
pages={211-252}
}