The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 85.9 top-1 accuracy on Kinetics-600 with ~20xless pre-training data and ~3xsmaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).
frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | reference top1 acc | reference top1 acc | testing protocol | FLOPs | params | config | ckpt | log |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32x2x1 | 224x224 | 8 | Swin-T | ImageNet-1k | 78.90 | 93.77 | 78.84 [VideoSwin] | 93.76 [VideoSwin] | 4 clips x 3 crop | 88G | 28.2M | config | ckpt | log |
32x2x1 | 224x224 | 8 | Swin-S | ImageNet-1k | 80.54 | 94.46 | 80.58 [VideoSwin] | 94.45 [VideoSwin] | 4 clips x 3 crop | 166G | 49.8M | config | ckpt | log |
32x2x1 | 224x224 | 8 | Swin-B | ImageNet-1k | 80.57 | 94.49 | 80.55 [VideoSwin] | 94.66 [VideoSwin] | 4 clips x 3 crop | 282G | 88.0M | config | ckpt | log |
32x2x1 | 224x224 | 8 | Swin-L | ImageNet-22k | 83.46 | 95.91 | 83.1* | 95.9* | 4 clips x 3 crop | 604G | 197M | config | ckpt | log |
frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | testing protocol | FLOPs | params | config | ckpt | log |
---|---|---|---|---|---|---|---|---|---|---|---|---|
32x2x1 | 224x224 | 16 | Swin-L | ImageNet-22k | 75.92 | 92.72 | 4 clips x 3 crop | 604G | 197M | config | ckpt | log |
frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | testing protocol | FLOPs | params | config | ckpt | log |
---|---|---|---|---|---|---|---|---|---|---|---|---|
32x2x1 | 224x224 | 32 | Swin-S | ImageNet-1k | 76.90 | 92.96 | 4 clips x 3 crop | 604G | 197M | config | ckpt | log |
- The gpus indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set
--auto-scale-lr
when callingtools/train.py
, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size. - The values in columns named after "reference" are the results got by testing on our dataset, using the checkpoints provided by the author with same model settings.
*
means that the numbers are copied from the paper. - The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at Kinetics400-Validation. The corresponding data list (each line is of the format 'video_id, num_frames, label_index') and the label map are also available.
- Pre-trained image models can be downloaded from Swin Transformer for ImageNet Classification.
For more details on data preparation, you can refer to Kinetics.
You can use the following command to train a model.
python tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train VideoSwin model on Kinetics-400 dataset in a deterministic option with periodic validation.
python tools/train.py configs/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py \
--seed=0 --deterministic
For more details, you can refer to the Training part in the Training and Test Tutorial.
You can use the following command to test a model.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: test VideoSwin model on Kinetics-400 dataset and dump the result to a pkl file.
python tools/test.py configs/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py \
checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
For more details, you can refer to the Test part in the Training and Test Tutorial.
@inproceedings{liu2022video,
title={Video swin transformer},
author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={3202--3211},
year={2022}
}