This is LAVSE (pronounced læːvɪs), the official source code for the ICCV'19 paper Language-Agnostic Visual-Semantic Embeddings. This repository is inspired by VSEPP, SCAN, and BootstrapPytorch.
Project page with live demo, another details and figures: https://jwehrmann.github.io/projects.lavse/.
- Training and validation of SOTA models in multiple datasets (COCO, Flickr30k, Multi30k, and YJCaptions).
- Single language and multi-language support (English, German, Japanese).
- Text encoders (GRU, Liwe, Glove embeddings) easy to add new options, for instance BERT.
- Image encoders (Precomp, full ConvNet encoders).
- Similarity computation (easy to extend, it supports attention layers).
- Warm-up hinge-loss function.
- Tensorboard logging.
- We introduce novel retrieval splits for YJ captions for retrieval evaluation.
- LIWE
- CLMR
- SCAN (i2t, t2i)
- VSEPP
Approach | Image Annotation R@1 | Image Retrieval R@1 | Test Time |
---|---|---|---|
LIWE+Glove (ours) | 73.2 | 57.9 | 1s |
LIWE (ours) | 71.8 | 55.5 | 1s |
CMLR (ours) | 71.8 | 56.2 | 1s |
SCAN-t2i (ours) | 70.9 | 56.4 | 50s |
SCAN-t2i | 70.9 | 56.4 | 250s |
SCAN-i2t | 69.2 | 54.4 | 250s |
Note that our implementation of SCAN is 5x faster than the original code.
Approach | Image Annotation R@1 | Image Retrieval R@1 | Test Time |
---|---|---|---|
LIWE+Glove (ours) | 69.6 | 51.2 | 1s |
LIWE (ours) | 66.4 | 47.5 | 1s |
CMLR (ours) | 64.0 | 46.8 | 1s |
SCAN-i2t | 67.9 | 43.9 | 250s |
SCAN-t2i | 61.8 | 45.8 | 250s |
Note that our implementation of SCAN is 5x faster than the original code.
Approach | Annotation (en) | Retrieval (en) | Annotation (de) | Retrieval (de) | #Params |
---|---|---|---|---|---|
LIWE (ours) | 64.4 | 47.5 | 53.0 | 36.7 | 3M |
CMLR (ours) | 59.9 | 43.9 | 50.4 | 34.6 | 12M |
BERT-ML | 62.0 | 42.7 | 50.9 | 33.2 | 110M |
Approach | Annotation (en) | Retrieval (en) | Annotation (jt) | Retrieval (jt) | Test Retrieval |
---|---|---|---|---|---|
LIWE (ours) | 59.2 | 46.1 | 48.6 | 37.0 | 1s |
CMLR (ours) | 56.9 | 43.2 | 51.4 | 38.6 | 1s |
SCAN-t2i | 58.2 | 47.4 | 48.2 | 39.6 | 250s |
You can download each dataset as follows.
- COCO+F30k Data:
wget https://scanproject.blob.core.windows.net/scan-data/data.zip
- Only annotations for COCO and F30k:
wget https://scanproject.blob.core.windows.net/scan-data/data_no_feature.zip
- YJCaptions:
wget https://wehrmann.s3-us-west-2.amazonaws.com/jap_precomp.tar
- Multi30k:
wget https://wehrmann.s3-us-west-2.amazonaws.com/m30k_precomp.tar
IMPORTANT: set your data path using:
export DATA_PATH=/path/to/data
Save the data in the $DATA_PATH, as follows:
$DATA_PATH
├── coco_precomp
│ └── train_caps.en.txt
│ └── train_ims.npy
│ └── train_ids.npy
│ └── dev_caps.txt
│ ...
├── f30k_precomp
│ └── train_caps.en.txt
│ └── train_ims.npy
│ └── train_ids.npy
│ └── dev_caps.txt
│ ...
├── m30k_precomp
│ └── train_caps.en.txt
│ └── train_caps.de.txt
│ └── train_ims.npy
│ └── train_ids.npy
│ └── dev_caps.en.txt
│ └── dev_caps.de.txt
│ ...
├── jap_precomp
│ └── train_caps.jp.txt
│ └── train_caps.jt.txt
│ └── train_caps.en.txt
│ └── train_ims.npy
│ └── train_ids.npy
│ └── dev_caps.jp.txt
│ └── dev_caps.jt.txt
│ └── dev_caps.en.txt
│ ...
To use full image encoders (only pure ConvNets is supported, i.e., no FasterRCNN-based encoders for now), you need to download the original images from their original sources.
Dependencies:
- Python >= 3.6
- Pytorch >= 1.1
- Addict
- PyYaml
The easiest way to setup your env is by running:
conda env create -f lavse370.yaml
In addition, you can easily change the DATA_PATH (path for the downloaded data), by running.
export DATA_PATH=/opt/jonatas/datasets/lavse/
Model configuration is done via yaml files. So training models is super easy:
python run.py -o options/<yaml_path>.yaml
Inside options/
you can find all the configuration files to reproduce our results. Scripts used to train models in our work are in options/liwe/train.sh
.
Evaluating models is also quite straightforward. It all depends on the yaml config file.
python test.py options/<path>.yaml --data_split <train/dev/test>
$ cd tools
$ find ../logs/ -name *json -print0 | xargs -0 python print_result.py
If you find this code/paper useful, please consider citing our work:
@inproceedings{wehrmann2019iccv,
title={Language-Agnostic Visual-Semantic Embeddings},
author={Wehrmann, Jonatas and Souza, Douglas M. and Lopes, Mauricio A. and Barros, Rodrigo C.},
booktitle={International Conference on Computer Vision},
year={2019}
}
@inproceedings{wehrmann2018cvpr,
title={Bidirectional retrieval made simple},
author={Wehrmann, Jonatas and Barros, Rodrigo C},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={7718--7726},
year={2018}
}