Skip to content

Latest commit

 

History

History
 
 

README.md

Applying ViT-Adapter to Object Detection

Our detection code is developed on top of MMDetection v2.22.0.

For details see Vision Transformer Adapter for Dense Predictions.

If you use this code for a paper please cite:

@article{chen2022vitadapter,
  title={Vision Transformer Adapter for Dense Predictions},
  author={Chen, Zhe and Duan, Yuchen and Wang, Wenhai and He, Junjun and Lu, Tong and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.08534},
  year={2022}
}

Usage

Install MMDetection v2.22.0.

# recommended environment: torch1.9 + cuda11.1
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.4.2 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install timm==0.4.12
pip install mmdet==2.22.0
pip install instaboostfast # for htc++
cd ops & sh make.sh # compile deformable attention

Data Preparation

Prepare COCO according to the guidelines in MMDetection v2.22.0.

Pre-training Sources

Name Type Year Data Repo Paper
DeiT Supervised 2021 ImageNet-1K repo paper
AugReg Supervised 2021 ImageNet-22K repo paper
BEiT MIM 2021 ImageNet-22K repo paper
MAE MIM 2021 ImageNet-1K repo paper
Uni-Perceiver Supervised 2022 Multi-Modal - paper

Results and Models

HTC++

Backbone Pre-train Lr schd mini-val test-dev #Param Config Download
box AP mask AP box AP mask AP
ViT-Adapter-L BEiT-L 3x+MS 58.4 50.8 58.9 51.3 401M config model
ViT-Adapter-L (TTA) BEiT-L 3x+MS 60.2 52.2 60.4 52.5 401M - -

Mask R-CNN

Method Backbone Pre-train Lr schd box AP mask AP #Param Config Download
Mask R-CNN ViT-Adapter-T DeiT-T 3x+MS 46.0 41.0 28M config model
Mask R-CNN ViT-Adapter-S DeiT-S 3x+MS 48.2 42.8 48M config model
Mask R-CNN ViT-Adapter-B DeiT-B 3x+MS 49.6 43.6 120M config model
Mask R-CNN ViT-Adapter-B Uni-Perceiver 3x+MS 50.7 44.9 120M config model
Mask R-CNN ViT-Adapter-L AugReg-L 3x+MS 50.9 44.8 348M config model

Advanced Detectors

Method Framework Pre-train Lr schd box AP mask AP #Param Config Download
ViT-Adapter-S Cascade Mask R-CNN DeiT-S 3x+MS 51.5 44.3 86M config model
ViT-Adapter-S ATSS DeiT-S 3x+MS 49.6 - 36M config model
ViT-Adapter-S GFL DeiT-S 3x+MS 50.0 - 36M config model
ViT-Adapter-S Sparse R-CNN DeiT-S 3x+MS 48.1 - 110M config model
ViT-Adapter-B Upgraded Mask R-CNN MAE-B 25ep+LSJ 50.3 44.7 122M config model
ViT-Adapter-B Upgraded Mask R-CNN MAE-B 50ep+LSJ 50.8 45.1 122M config model

Evaluation

To evaluate ViT-Adapter-L + HTC++ on COCO val2017 on a single node with 8 gpus run:

sh dist_test.sh configs/htc++/htc++_beit_adapter_large_fpn_3x_coco.py /path/to/checkpoint_file 8 --eval bbox segm

This should give

Evaluate annotation type *bbox*
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.584
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.771
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.642
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.441
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.622
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.725
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.742
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.742
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.742
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.615
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.775
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.864

Evaluate annotation type *segm*
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.508
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.750
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.556
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.331
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.542
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.687
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.645
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.645
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.645
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.503
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.681
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.780

Training

To train ViT-Adapter-T + Mask R-CNN on COCO train2017 on a single node with 8 gpus for 36 epochs run:

sh dist_train.sh configs/mask_rcnn/mask_rcnn_deit_adapter_tiny_fpn_3x_coco.py 8

Image Demo & Video Demo

Please see issue#23.