Skip to content

Latest commit

 

History

History
93 lines (63 loc) · 10.2 KB

README.md

File metadata and controls

93 lines (63 loc) · 10.2 KB

VideoSwin

Video Swin Transformer

Abstract

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 85.9 top-1 accuracy on Kinetics-600 with ~20xless pre-training data and ~3xsmaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

Results and Models

Kinetics-400

frame sampling strategy resolution gpus backbone pretrain top1 acc top5 acc reference top1 acc reference top1 acc testing protocol FLOPs params config ckpt log
32x2x1 224x224 8 Swin-T ImageNet-1k 78.90 93.77 78.84 [VideoSwin] 93.76 [VideoSwin] 4 clips x 3 crop 88G 28.2M config ckpt log
32x2x1 224x224 8 Swin-S ImageNet-1k 80.54 94.46 80.58 [VideoSwin] 94.45 [VideoSwin] 4 clips x 3 crop 166G 49.8M config ckpt log
32x2x1 224x224 8 Swin-B ImageNet-1k 80.57 94.49 80.55 [VideoSwin] 94.66 [VideoSwin] 4 clips x 3 crop 282G 88.0M config ckpt log
32x2x1 224x224 8 Swin-L ImageNet-22k 83.46 95.91 83.1* 95.9* 4 clips x 3 crop 604G 197M config ckpt log

Kinetics-700

frame sampling strategy resolution gpus backbone pretrain top1 acc top5 acc testing protocol FLOPs params config ckpt log
32x2x1 224x224 16 Swin-L ImageNet-22k 75.92 92.72 4 clips x 3 crop 604G 197M config ckpt log

Kinetics-710

frame sampling strategy resolution gpus backbone pretrain top1 acc top5 acc testing protocol FLOPs params config ckpt log
32x2x1 224x224 32 Swin-S ImageNet-1k 76.90 92.96 4 clips x 3 crop 604G 197M config ckpt log
  1. The gpus indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set --auto-scale-lr when calling tools/train.py, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.
  2. The values in columns named after "reference" are the results got by testing on our dataset, using the checkpoints provided by the author with same model settings. * means that the numbers are copied from the paper.
  3. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at Kinetics400-Validation. The corresponding data list (each line is of the format 'video_id, num_frames, label_index') and the label map are also available.
  4. Pre-trained image models can be downloaded from Swin Transformer for ImageNet Classification.

For more details on data preparation, you can refer to Kinetics.

Train

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train VideoSwin model on Kinetics-400 dataset in a deterministic option with periodic validation.

python tools/train.py configs/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py \
    --seed=0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test VideoSwin model on Kinetics-400 dataset and dump the result to a pkl file.

python tools/test.py configs/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@inproceedings{liu2022video,
  title={Video swin transformer},
  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={3202--3211},
  year={2022}
}