This is an official PyTorch implementation of Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation.
-
Environment
- PyTorch (e.g. 1.8.1+cu111)
- Other dependencies in
requirements.txt
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html pip install -r requirements.txt
-
Datasets
- The detailed instruction is in prepare_datasets.md
-
Pretrained weights
- Download the pretrained weights of ResNet-50/101 and ViT-B to
pretrain
mkdir pretrain && cd pretrain # ResNet-50 wget https://openaipublic.azureedge.net/clip/models/afeb0e10f9e5a86da6080e35cf09123aca3b358a0c3e3b6c78a7b63bc04b6762/RN50.pt # ResNet-101 wget https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt # ViT-B wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
- Download the pretrained weights of ResNet-50/101 and ViT-B to
To do training of ETRIS, modify the script according to your requirement and run:
bash run_scripts/train.sh
To do evaluation of ETRIS, modify the script according to your requirement and run:
bash run_scripts/test.sh
The weights of our model have been made available at the following link: https://pan.baidu.com/s/1jaOJKdIg1t8wnWrxgCkkRA?pwd=vmyv Please note that you may need to enter the password "vmyv" to access the files.
The code is based on CRIS. We thank the authors for their open-sourced code and encourage users to cite their works when applicable.
If ETRIS is useful for your research, please consider citing:
@inproceedings{xu2023bridging,
title={Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation},
author={Xu, Zunnan and Chen, Zhihong and Zhang, Yong and Song, Yibing and Wan, Xiang and Li, Guanbin},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={17503--17512},
year={2023}
}