GitHub - Westlake-AI/A2MIM at df68febcd611c29a9fd672a9e44d95758cbfc20d

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs/openmixup		configs/openmixup
det_detectron2		det_detectron2
det_mmdetection		det_mmdetection
seg_mmsegmentation		seg_mmsegmentation
tools		tools
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md

Repository files navigation

Architecture-Agnostic Masked Image Modeling - From ViT back to CNN

Siyuan Li^*,1,2, Di Wu^*,1,2, Fang Wu^1,3, Zelin Zang^1,2, Stan Z. Li^†,1

¹Westlake University, ²Zhejiang University, ²Tsinghua University

Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers (ViT). Its underlying idea is simple: a portion of the input image is randomly masked out and then reconstructed via the pre-text task. However, why MIM works well is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this paper, we first study interactions among patches to understand what knowledge is learned and how it is acquired via the MIM task. We observe that MIM essentially teaches the model to learn better middle-level interactions among patches and extract more generalized features. Based on this fact, we propose an Architecture-Agnostic Masked Image Modeling framework (A2MIM), which is compatible with not only Transformers but also CNNs in a unified way. Extensive experiments on popular benchmarks show that our A2MIM learns better representations and endows the backbone model with the stronger capability to transfer to various downstream tasks for both Transformers and CNNs.

Table of Contents

Catalog
License
Acknowledgement
Citation

Catalog

We have released implementations of A2MIM based on OpenMixup. In the future, we plan to add A2MIM implementations to MMPretrain. Pre-trained and fine-tuned models are released in GitHub / Baidu Cloud.

Update camera-ready version of A2MIM or arXiv.
ImageNet pre-training and fine-tuning with OpenMixup [config_pretrain] [config_finetune]
ImageNet pre-training and fine-tuning with MMPretrain
Downstream Transfer to Object Detection on COCO with MMDetection [config]
Downstream Transfer to Semantic Segmentation on ADE20K MMSegmentation [config]
Analysis tools and results
Visualization of pre-training on Google Colab and Notebook Demo

Pre-training on ImageNet

1. Installation

Please refer to INSTALL.md for installation instructions.

2. Pre-training and fine-tuning

We provide scripts for multiple GPUs pre-training and the specified CONFIG_FILE.

bash tools/dist_train.sh ${CONFIG_FILE} ${GPUS} [optional arguments]

For example, you can run the script below to pre-train ResNet-50 with A2MIM on ImageNet with 8 GPUs:

PORT=29500 bash tools/dist_train.sh configs/openmixup/pretrain/a2mim/imagenet/r50_l3_sz224_init_8xb256_cos_ep300.py 8

After pre-trianing, you can fine-tune and evaluate the models with the corresponding script:

python tools/model_converters/extract_backbone_weights.py work_dirs/openmixup/pretrain/a2mim/imagenet/r50_l3_sz224_init_8xb256_cos_ep300/latest.pth ${PATH_TO_CHECKPOINT}
PORT=29500 bash tools/dist_train_ft_8gpu.sh configs/openmixup/finetune/imagenet/r50_rsb_a3_ft_sz160_4xb512_cos_fp16_ep100.py ${PATH_TO_CHECKPOINT}

License

This project is released under the Apache 2.0 license.

Acknowledgement

Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.

OpenMixup: Open-source toolbox for supervised and self-supervised visual representation learning.
pytorch-image-models: PyTorch image models, scripts, pretrained weights.
SimMIM: Official PyTorch implementation of SimMIM.
MMPretrain: OpenMMLab Pre-training Toolbox and Benchmark.
MMDetection: OpenMMLab Detection Toolbox and Benchmark.
MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark.

Citation

If you find this repository helpful, please consider citing our paper:

@inproceedings{zbontar2021barlow,
  title={Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN},
  author={Li, Siyuan and Wu, Di and Wu, Fang and Zang, Zelin and Li, Stan. Z.},
  booktitle={International Conference on Machine Learning},
  year={2023},
}

(back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Architecture-Agnostic Masked Image Modeling - From ViT back to CNN

Catalog

Pre-training on ImageNet

1. Installation

2. Pre-training and fine-tuning

License

Acknowledgement

Citation

About

Releases 1

Packages

Languages

License

Westlake-AI/A2MIM

Folders and files

Latest commit

History

Repository files navigation

Architecture-Agnostic Masked Image Modeling - From ViT back to CNN

Catalog

Pre-training on ImageNet

1. Installation

2. Pre-training and fine-tuning

License

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages