This repository contains the official source code for our paper:
Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Dongwon Kim, Namyup Kim, and Suha Kwak
POSTECH CSE
CVPR (Highlight), Vancouver, 2023.
Parts of our codes are adopted from the following repositories.
- https://github.com/yalesong/pvse
- https://github.com/fartashf/vsepp
- https://github.com/lucidrains/perceiver-pytorch
data
├─ coco_download.sh
├─ coco # can be downloaded with the coco_download.sh
│ ├─ images
│ │ └─ ......
│ └─ annotations
│ └─ ......
├─ coco_butd
│ └─ precomp
│ ├─ train_ids.txt
│ ├─ train_caps.txt
│ └─ ......
├─ f30k
│ ├─ images
│ │ └─ ......
│ ├─ dataset_flickr30k.json
│ └─ ......
└─ f30k_butd
└─ precomp
├─ train_ids.txt
├─ train_caps.txt
└─ ......
vocab # included in this repo
├─ coco_butd_vocab.pkl
└─ ......
-
coco_butd
andf30k_butd
: Datasets used for the Faster-RCNN image backbone. We use the pre-computed features provided by SCAN, which can be downloaded via https://github.com/kuanghuei/SCAN#download-data. -
coco
andf30k
: Datasets used for the CNN backbones. Please refer the COCO download script and Flickr30K website+Flickr30K .json to download the images and captions.
Note: Downloaded datasets should be placed according to the directory structure presented above.
You can install requirements using conda.
conda create --name <env> --file requirements.txt
sh train_eval_coco.sh