Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
deit_base_8xb128_ep300.py		deit_base_8xb128_ep300.py
deit_base_adan_8xb256_fp16_ep150.py		deit_base_adan_8xb256_fp16_ep150.py
deit_base_adan_8xb256_fp16_ep300.py		deit_base_adan_8xb256_fp16_ep300.py
deit_base_rsb_a3_sz160_8xb256_ep100.py		deit_base_rsb_a3_sz160_8xb256_ep100.py
deit_small_8xb128_ep300.py		deit_small_8xb128_ep300.py
deit_small_adan_8xb256_fp16_ep150.py		deit_small_adan_8xb256_fp16_ep150.py
deit_small_adan_8xb256_fp16_ep300.py		deit_small_adan_8xb256_fp16_ep300.py
deit_small_adan_I_8xb256_fp16_ep150.py		deit_small_adan_I_8xb256_fp16_ep150.py
deit_small_rsb_a3_sz160_8xb256_ep100.py		deit_small_rsb_a3_sz160_8xb256_ep100.py
deit_tiny_8xb128_ep300.py		deit_tiny_8xb128_ep300.py
deit_tiny_rsb_a3_sz160_8xb256_ep100.py		deit_tiny_rsb_a3_sz160_8xb256_ep100.py

README.md

DeiT

Training data-efficient image transformers & distillation through attention

Abstract

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

Results and models

This page is based on documents in MMClassification.

ImageNet-1k

Model	Pretrain	Params(M)	Flops(G)	Top-1 (%)	Top-5 (%)	Config	Download
DeiT-T	From scratch	5.72	1.08	73.56	91.16	config	model \| log
DeiT-T*	From scratch	5.72	1.08	72.20	91.10	config	model
DeiT-S	From scratch	22.05	4.24	79.93	95.14	config	model \| log
DeiT-S*	From scratch	22.05	4.24	79.90	95.10	config	model
DeiT-B	From scratch	86.57	16.86	81.82	95.57	config	model \| log
DeiT-B*	From scratch	86.57	16.86	81.80	95.60	config	model
DeiT-B distilled*	From scratch	86.57	16.86	83.33	96.49	config	model

We follow the original training setting provided by the official repo and reproduce the performance of 300-epoch training from scratch without distillation. Note that this repo does not support the distillation loss in DeiT. Models with * are provided by the official repo.

Citation

@InProceedings{icml2021deit,
  title =     {Training data-efficient image transformers &amp; distillation through attention},
  author =    {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
  booktitle = {International Conference on Machine Learning},
  pages =     {10347--10357},
  year =      {2021},
  volume =    {139},
  month =     {July}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deit

deit

README.md

DeiT

Abstract

Results and models

ImageNet-1k

Citation

Files

deit

Directory actions

More options

Directory actions

More options

Latest commit

History

deit

Folders and files

parent directory

README.md

DeiT

Abstract

Results and models

ImageNet-1k

Citation