Skip to content

danczs/Visformer

Repository files navigation

Visformer

pytorch

Introduction

This is a pytorch implementation for the Visformer models. This project is based on the training code in DeiT and the tools in timm.

Usage

Clone the repository:

git clone https://github.com/danczs/Visformer.git

Install pytorch, timm and einops:

pip install -r requirements.txt

Data Preparation

The layout of Imagenet data:

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img1.jpeg
    class2/
      img2.jpeg

Network Training

Visformer_small

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save

Visformer_tiny

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model visformer_tiny --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save

Viformer V2 models

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model swin_visformer_small_v2 --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model swin_visformer_tiny_v2 --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5

The model performance:

model top-1 (%) FLOPs (G) paramters (M)
Visformer_tiny 78.6 1.3 10.3
Visformer_tiny_V2 79.6 1.3 9.4
Visformer_small 82.2 4.9 40.2
Visformer_small_V2 83.0 4.3 23.6
Visformer_medium_V2 83.6 8.5 44.5

pre-trained models:

model model log top-1 (%)
Visformer_small (original) github github 82.21
Visformer_small (+ Swin for downstream tasks) github github 82.34
Visformer_small_v2 (+ Swin for downstream tasks) github github 83.00
Visformer_medium_v2 (+ Swin for downstream tasks) github github 83.62

(In some logs, the model is only tested for the last 50 epochs to save the training time.)

More information about Visformer V2.

Object Detection on COCO

The standard self-attention is not efficient for high-reolution inputs, so we simply replace the standard self-attention with Swin-attention for object detection. Therefore, Swin Transformer is our directly baseline.

Mask R-CNN

Backbone sched box mAP mask mAP params FLOPs FPS
Swin-T 1x 42.6 39.3 48 267 14.8
Visformer-S 1x 43.0 39.6 60 275 13.1
VisformerV2-S 1x 44.8 40.7 43 262 15.2
Swin-T 3x + MS 46.0 41.6 48 367 14.8
VisformerV2-S 3x + MS 47.8 42.5 43 262 15.2

Cascade Mask R-CNN

Backbone sched box mAP mask mAP params FLOPs FPS
Swin-T 1x + MS 48.1 41.7 86 745 9.5
VisformerV2-S 1x + MS 49.3 42.3 81 740 9.6
Swin-T 3x + MS 50.5 43.7 86 745 9.5
VisformerV2-S 3x + MS 51.6 44.1 81 740 9.6

This repo only contains the key files for object detection ('./ObjectDetction'). Swin-Visformer-Object-Detection is the full detection project.

Pre-trained Model

Beacause of the policy of our institution, we cannot send the pre-trained models out directly. Thankfully, @hzhang57 and @developer0hye provides Visformer_small and Visformer_tiny models trained by themselves.

Automatic Mixed Precision (amp)

In the original version of Visformer, amp can cause NaN values. We find that the overflow comes from the attention mask:

scale = head_dim ** -0.5
attn = ( q  @ k.transpose(-2,-1) ) * scale

To avoid overflow, we pre-normalize q & k, and, thus, overall normalize 'attn' with 'head_dim' instead of 'head_dim ** 0.5':

scale = head_dim ** -0.5
attn =  (q * scale) @ (k.transpose(-2,-1) * scale) 

Amp training:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model visformer_tiny --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5

This change won't degrade the training performance.

Using amp for the original pre-trained models:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --eval --resume /path/to/weights --amp

Citing

@inproceedings{chen2021visformer,
  title={Visformer: The vision-friendly transformer},
  author={Chen, Zhengsu and Xie, Lingxi and Niu, Jianwei and Liu, Xuefeng and Wei, Longhui and Tian, Qi},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={589--598},
  year={2021}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages