Our detection code is developed on top of MMDetection v2.22.0.
For details see Vision Transformer Adapter for Dense Predictions.
If you use this code for a paper please cite:
@article{chen2022vitadapter,
title={Vision Transformer Adapter for Dense Predictions},
author={Chen, Zhe and Duan, Yuchen and Wang, Wenhai and He, Junjun and Lu, Tong and Dai, Jifeng and Qiao, Yu},
journal={arXiv preprint arXiv:2205.08534},
year={2022}
}
Install MMDetection v2.22.0.
# recommended environment: torch1.9 + cuda11.1
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.4.2 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install timm==0.4.12
pip install mmdet==2.22.0
pip install instaboostfast # for htc++
cd ops & sh make.sh # compile deformable attention
Prepare COCO according to the guidelines in MMDetection v2.22.0.
| Name | Type | Year | Data | Repo | Paper |
|---|---|---|---|---|---|
| DeiT | Supervised | 2021 | ImageNet-1K | repo | paper |
| AugReg | Supervised | 2021 | ImageNet-22K | repo | paper |
| BEiT | MIM | 2021 | ImageNet-22K | repo | paper |
| MAE | MIM | 2021 | ImageNet-1K | repo | paper |
| Uni-Perceiver | Supervised | 2022 | Multi-Modal | - | paper |
HTC++
| Backbone | Pre-train | Lr schd | mini-val | test-dev | #Param | Config | Download | ||
| box AP | mask AP | box AP | mask AP | ||||||
| ViT-Adapter-L | BEiT-L | 3x+MS | 58.4 | 50.8 | 58.9 | 51.3 | 401M | config | model |
| ViT-Adapter-L (TTA) | BEiT-L | 3x+MS | 60.2 | 52.2 | 60.4 | 52.5 | 401M | - | - |
Mask R-CNN
| Method | Backbone | Pre-train | Lr schd | box AP | mask AP | #Param | Config | Download |
|---|---|---|---|---|---|---|---|---|
| Mask R-CNN | ViT-Adapter-T | DeiT-T | 3x+MS | 46.0 | 41.0 | 28M | config | model |
| Mask R-CNN | ViT-Adapter-S | DeiT-S | 3x+MS | 48.2 | 42.8 | 48M | config | model |
| Mask R-CNN | ViT-Adapter-B | DeiT-B | 3x+MS | 49.6 | 43.6 | 120M | config | model |
| Mask R-CNN | ViT-Adapter-B | Uni-Perceiver | 3x+MS | 50.7 | 44.9 | 120M | config | model |
| Mask R-CNN | ViT-Adapter-L | AugReg-L | 3x+MS | 50.9 | 44.8 | 348M | config | model |
Advanced Detectors
| Method | Framework | Pre-train | Lr schd | box AP | mask AP | #Param | Config | Download |
|---|---|---|---|---|---|---|---|---|
| ViT-Adapter-S | Cascade Mask R-CNN | DeiT-S | 3x+MS | 51.5 | 44.3 | 86M | config | model |
| ViT-Adapter-S | ATSS | DeiT-S | 3x+MS | 49.6 | - | 36M | config | model |
| ViT-Adapter-S | GFL | DeiT-S | 3x+MS | 50.0 | - | 36M | config | model |
| ViT-Adapter-S | Sparse R-CNN | DeiT-S | 3x+MS | 48.1 | - | 110M | config | model |
| ViT-Adapter-B | Upgraded Mask R-CNN | MAE-B | 25ep+LSJ | 50.3 | 44.7 | 122M | config | model |
| ViT-Adapter-B | Upgraded Mask R-CNN | MAE-B | 50ep+LSJ | 50.8 | 45.1 | 122M | config | model |
To evaluate ViT-Adapter-L + HTC++ on COCO val2017 on a single node with 8 gpus run:
sh dist_test.sh configs/htc++/htc++_beit_adapter_large_fpn_3x_coco.py /path/to/checkpoint_file 8 --eval bbox segmThis should give
Evaluate annotation type *bbox*
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.584
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.771
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.642
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.441
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.622
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.725
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.742
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.742
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.742
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.615
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.775
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.864
Evaluate annotation type *segm*
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.508
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.750
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.556
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.331
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.542
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.687
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.645
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.645
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.645
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.503
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.681
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.780
To train ViT-Adapter-T + Mask R-CNN on COCO train2017 on a single node with 8 gpus for 36 epochs run:
sh dist_train.sh configs/mask_rcnn/mask_rcnn_deit_adapter_tiny_fpn_3x_coco.py 8Please see issue#23.