Name	Name	Last commit message	Last commit date
parent directory ..
configs	configs
mmcv_custom	mmcv_custom
mmdet_custom	mmdet_custom
ops	ops
README.md	README.md
dist_test.sh	dist_test.sh
dist_train.sh	dist_train.sh
image_demo.py	image_demo.py
slurm_test.sh	slurm_test.sh
slurm_train.sh	slurm_train.sh
test.py	test.py
train.py	train.py
video_demo.py	video_demo.py

Applying ViT-Adapter to Object Detection

Our detection code is developed on top of MMDetection v2.22.0.

For details see Vision Transformer Adapter for Dense Predictions.

If you use this code for a paper please cite:

@article{chen2022vitadapter,
  title={Vision Transformer Adapter for Dense Predictions},
  author={Chen, Zhe and Duan, Yuchen and Wang, Wenhai and He, Junjun and Lu, Tong and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.08534},
  year={2022}
}

Usage

Install MMDetection v2.22.0.

# recommended environment: torch1.9 + cuda11.1
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.4.2 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install timm==0.4.12
pip install mmdet==2.22.0
pip install instaboostfast # for htc++
cd ops & sh make.sh # compile deformable attention

Data Preparation

Prepare COCO according to the guidelines in MMDetection v2.22.0.

Pre-training Sources

Name	Type	Year	Data	Repo	Paper
DeiT	Supervised	2021	ImageNet-1K	repo	paper
AugReg	Supervised	2021	ImageNet-22K	repo	paper
BEiT	MIM	2021	ImageNet-22K	repo	paper
MAE	MIM	2021	ImageNet-1K	repo	paper
Uni-Perceiver	Supervised	2022	Multi-Modal	-	paper

Results and Models

HTC++

Backbone	Pre-train	Lr schd	mini-val		test-dev		#Param	Config	Download
Backbone	Pre-train	Lr schd	box AP	mask AP	box AP	mask AP	#Param	Config	Download
ViT-Adapter-L	BEiT-L	3x+MS	58.4	50.8	58.9	51.3	401M	config	model
ViT-Adapter-L (TTA)	BEiT-L	3x+MS	60.2	52.2	60.4	52.5	401M	-	-

Mask R-CNN

Method	Backbone	Pre-train	Lr schd	box AP	mask AP	#Param	Config	Download
Mask R-CNN	ViT-Adapter-T	DeiT-T	3x+MS	46.0	41.0	28M	config	model
Mask R-CNN	ViT-Adapter-S	DeiT-S	3x+MS	48.2	42.8	48M	config	model
Mask R-CNN	ViT-Adapter-B	DeiT-B	3x+MS	49.6	43.6	120M	config	model
Mask R-CNN	ViT-Adapter-B	Uni-Perceiver	3x+MS	50.7	44.9	120M	config	model
Mask R-CNN	ViT-Adapter-L	AugReg-L	3x+MS	50.9	44.8	348M	config	model

Advanced Detectors

Method	Framework	Pre-train	Lr schd	box AP	mask AP	#Param	Config	Download
ViT-Adapter-S	Cascade Mask R-CNN	DeiT-S	3x+MS	51.5	44.3	86M	config	model
ViT-Adapter-S	ATSS	DeiT-S	3x+MS	49.6	-	36M	config	model
ViT-Adapter-S	GFL	DeiT-S	3x+MS	50.0	-	36M	config	model
ViT-Adapter-S	Sparse R-CNN	DeiT-S	3x+MS	48.1	-	110M	config	model
ViT-Adapter-B	Upgraded Mask R-CNN	MAE-B	25ep+LSJ	50.3	44.7	122M	config	model
ViT-Adapter-B	Upgraded Mask R-CNN	MAE-B	50ep+LSJ	50.8	45.1	122M	config	model

Evaluation

To evaluate ViT-Adapter-L + HTC++ on COCO val2017 on a single node with 8 gpus run:

sh dist_test.sh configs/htc++/htc++_beit_adapter_large_fpn_3x_coco.py /path/to/checkpoint_file 8 --eval bbox segm

This should give

Evaluate annotation type *bbox*
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.584
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.771
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.642
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.441
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.622
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.725
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.742
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.742
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.742
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.615
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.775
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.864

Evaluate annotation type *segm*
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.508
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.750
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.556
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.331
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.542
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.687
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.645
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.645
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.645
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.503
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.681
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.780

Training

To train ViT-Adapter-T + Mask R-CNN on COCO train2017 on a single node with 8 gpus for 36 epochs run:

sh dist_train.sh configs/mask_rcnn/mask_rcnn_deit_adapter_tiny_fpn_3x_coco.py 8

Image Demo & Video Demo

Please see issue#23.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Applying ViT-Adapter to Object Detection

Usage

Data Preparation

Pre-training Sources

Results and Models

Evaluation

Training

Image Demo & Video Demo

FilesExpand file tree

detection

Directory actions

More options

Directory actions

More options

Latest commit

History

detection

Folders and files

parent directory

README.md

Applying ViT-Adapter to Object Detection

Usage

Data Preparation

Pre-training Sources

Results and Models

Evaluation

Training

Image Demo & Video Demo