We release code and models of MLACLIP on multilingual image-text retrieval. The models are trained on CC300K and finetuned on Multi30K.
torch >= 1.7.1
transformers
opencv-python
The pretrained models (CLIP & M-BERT, for initialization) can be downloaded here
unzip pretrained_model.zip
Detail configuration files and checkpoints can be found here
unzip expr.zip
Download annotations and unzip it to ./dataset/
unzip dataset.zip
Conceptual Caption images can be crawled here. After crawled from the web, place all images under dataset/ConceptualCaption/images
CC300K are used to train the released models. This subset can be found here dataset/ConceptualCaption/cc300k.json
Flickr30K images can be requested here. Untar it to dataset/Multi30k
tar -xzvf flickr30k_images.tar.gz -C dataset/Multi30k
MSCOCO images can be downloaded and prepared with the following scripts:
wget -c http://images.cocodataset.org/zips/train2014.zip
wget -c http://images.cocodataset.org/zips/val2014.zip
wget -c http://images.cocodataset.org/zips/test2014.zip
mkdir -p dataset/MSCOCO/images
unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/train2014.zip
unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/val2014.zip
unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/test2014.zip
# NLT stage
bash train.sh \
expr/vitb32/NLT/config.json 0
# LE stage:
bash train.sh \
expr/vitb32/LE/config.json 0
bash train.sh \
expr/vitb32/finetune-en-m30k/config.json 0
bash train.sh \
expr/vitb32/finetune-all-m30k/config.json 0
bash inference.sh \
expr/vitb32/LE/pytorch_model.bin.1 \
expr/vitb32/LE/pytorch_model.bin.1 \
m30k+coco \
expr/vitb32/LE/eval_m30k+coco
bash inference.sh \
expr/vitb32/finetune-en-m30k/pytorch_model.bin.4 \
expr/vitb32/LE/pytorch_model.bin.1 \
m30k \
expr/vitb32/finetune-en-m30k/eval_m30k
bash inference.sh \
expr/vitb32/finetune-en-m30k/pytorch_model.bin.4 \
expr/vitb32/finetune-all-m30k/pytorch_model.bin.10000 \
m30k \
expr/vitb32/finetune-all-m30k/eval_m30k