|
| 1 | +# Official Repository for CIReVL [ICLR 2024] |
| 2 | + |
| 3 | +### Vision-by-Language for Training-Free Compositional Image Retrieval |
| 4 | + |
| 5 | +__Authors__: Shyamgopal Karthik*, Karsten Roth*, Massimilano Mancini, Zeynep Akata |
| 6 | + |
| 7 | +[](https://arxiv.org/abs/2310.09291) |
| 8 | +<!-- [](https://github.com/miccunifi/SEARLE) --> |
| 9 | +<!-- [](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on?p=zero-shot-composed-image-retrieval-with)\ |
| 10 | +[](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on-1?p=zero-shot-composed-image-retrieval-with)\ |
| 11 | +[](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on-2?p=zero-shot-composed-image-retrieval-with) --> |
| 12 | + |
| 13 | +This repo extends the great code repository of [SEARLE](https://arxiv.org/abs/2303.15247), link [here](https://github.com/miccunifi/SEARLE). |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## Overview |
| 18 | + |
| 19 | +### Abstract |
| 20 | + |
| 21 | +Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## Table of Contents |
| 30 | + |
| 31 | +- [Setting everything up](#setting-everything-up) |
| 32 | + - [Required Conda Environment](#required-conda-environment) |
| 33 | + - [Required Datasets](#required-datasets) |
| 34 | +- [Running and Evaluation CIReVL](#running-and-evaluating-cirevl-on-all-datasets) |
| 35 | +- [Citations](#citation) |
| 36 | + |
| 37 | + |
| 38 | +--- |
| 39 | + |
| 40 | +## Setting Everything Up |
| 41 | + |
| 42 | +### Required Conda Environment |
| 43 | + |
| 44 | +After cloning this repository, install the revelant packages using |
| 45 | + |
| 46 | +```sh |
| 47 | +conda create -n cirevl -y python=3.8 |
| 48 | +conda activate cirevl |
| 49 | +pip install torch==1.11.0 torchvision==0.12.0 transformers==4.24.0 tqdm termcolor pandas==1.4.2 openai==0.28.0 salesforce-lavis open_clip_torch |
| 50 | +pip install git+https://github.com/openai/CLIP.git |
| 51 | +``` |
| 52 | + |
| 53 | +__Note__ that to use a BLIP(-2) caption model by default, you need access to GPUs that allow for use of `bloat16` (e.g. `A100` types). |
| 54 | + |
| 55 | +### Required Datasets |
| 56 | + |
| 57 | +#### FashionIQ |
| 58 | + |
| 59 | +Download the FashionIQ dataset following the instructions in |
| 60 | +the [**official repository**](https://github.com/XiaoxiaoGuo/fashion-iq). |
| 61 | +After downloading the dataset, ensure that the folder structure matches the following: |
| 62 | + |
| 63 | +``` |
| 64 | +├── FASHIONIQ |
| 65 | +│ ├── captions |
| 66 | +| | ├── cap.dress.[train | val | test].json |
| 67 | +| | ├── cap.toptee.[train | val | test].json |
| 68 | +| | ├── cap.shirt.[train | val | test].json |
| 69 | +
|
| 70 | +│ ├── image_splits |
| 71 | +| | ├── split.dress.[train | val | test].json |
| 72 | +| | ├── split.toptee.[train | val | test].json |
| 73 | +| | ├── split.shirt.[train | val | test].json |
| 74 | +
|
| 75 | +│ ├── images |
| 76 | +| | ├── [B00006M009.jpg | B00006M00B.jpg | B00006M6IH.jpg | ...] |
| 77 | +``` |
| 78 | + |
| 79 | +#### CIRR |
| 80 | + |
| 81 | +Download the CIRR dataset following the instructions in the [**official repository**](https://github.com/Cuberick-Orion/CIRR). |
| 82 | +After downloading the dataset, ensure that the folder structure matches the following: |
| 83 | + |
| 84 | +``` |
| 85 | +├── CIRR |
| 86 | +│ ├── train |
| 87 | +| | ├── [0 | 1 | 2 | ...] |
| 88 | +| | | ├── [train-10108-0-img0.png | train-10108-0-img1.png | ...] |
| 89 | +
|
| 90 | +│ ├── dev |
| 91 | +| | ├── [dev-0-0-img0.png | dev-0-0-img1.png | ...] |
| 92 | +
|
| 93 | +│ ├── test1 |
| 94 | +| | ├── [test1-0-0-img0.png | test1-0-0-img1.png | ...] |
| 95 | +
|
| 96 | +│ ├── cirr |
| 97 | +| | ├── captions |
| 98 | +| | | ├── cap.rc2.[train | val | test1].json |
| 99 | +| | ├── image_splits |
| 100 | +| | | ├── split.rc2.[train | val | test1].json |
| 101 | +``` |
| 102 | + |
| 103 | +#### CIRCO |
| 104 | + |
| 105 | +Download the CIRCO dataset following the instructions in the [**official repository**](https://github.com/miccunifi/CIRCO). |
| 106 | +After downloading the dataset, ensure that the folder structure matches the following: |
| 107 | + |
| 108 | +``` |
| 109 | +├── CIRCO |
| 110 | +│ ├── annotations |
| 111 | +| | ├── [val | test].json |
| 112 | +
|
| 113 | +│ ├── COCO2017_unlabeled |
| 114 | +| | ├── annotations |
| 115 | +| | | ├── image_info_unlabeled2017.json |
| 116 | +| | ├── unlabeled2017 |
| 117 | +| | | ├── [000000243611.jpg | 000000535009.jpg | ...] |
| 118 | +``` |
| 119 | + |
| 120 | + |
| 121 | +#### GeneCIS |
| 122 | + |
| 123 | + |
| 124 | + |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +## Running and Evaluating CIReVL on all Datasets |
| 129 | + |
| 130 | +Exemplary runs to compute all relevant evaluation metrics across all four benchmark datasets are provided in `example_benchmark_runs/example_benchmark_runs.sh` . |
| 131 | + |
| 132 | +For example, to compute retrieval metrics on the Fashion-IQ Dress subset, simply run: |
| 133 | + |
| 134 | +```sh |
| 135 | +datapath=/mnt/datasets_r/FASHIONIQ |
| 136 | +python src/main.py --dataset fashioniq_dress --split val --dataset-path $datapath --preload img_features captions mods --llm_prompt prompts.structural_modifier_prompt_fashion --clip ViT-B-32 |
| 137 | +``` |
| 138 | + |
| 139 | +This call to `src/main.py` includes the majority of relevant handles: |
| 140 | + |
| 141 | +```sh |
| 142 | +--dataset [name_of_dataset] #Specific dataset to use, s.a. cirr, circo, fashioniq_dress, fashioniq_shirt (...) |
| 143 | +--split [val_or_test] #Compute either validation metrics, or generate a test submission file where needed (cirr, circo). |
| 144 | +--dataset-path [path_to_dataset_folder] |
| 145 | +--preload [list_of_things_to_save_and_preload_if_available] #One can pass img_features, captions and mods (modified captions). Depending on which is passed, the correspondingly generated img_features, BLIP-captions and LLM-modified captions will be stored. If the script is called again using the same parameters, the saved data is loaded instead - which is much quicker. This is particularly useful when switching different models (such as the llm for different modified captions, or the retrieval model via img_features). |
| 146 | +--llm_prompt [prompts.name_of_prompt_str] #LLM prompt to use. |
| 147 | +--clip [name_of_openclip_model] #OpenCLIP model to use for crossmodal retrieval. |
| 148 | +``` |
| 149 | + |
| 150 | +--- |
| 151 | + |
| 152 | +## Citation |
| 153 | + |
| 154 | +```bibtex |
| 155 | +@misc{karthik2023visionbylanguage, |
| 156 | + title={Vision-by-Language for Training-Free Compositional Image Retrieval}, |
| 157 | + author={Shyamgopal Karthik and Karsten Roth and Massimiliano Mancini and Zeynep Akata}, |
| 158 | + year={2023}, |
| 159 | + eprint={2310.09291}, |
| 160 | + archivePrefix={arXiv}, |
| 161 | + primaryClass={cs.CV} |
| 162 | +} |
| 163 | +``` |
| 164 | + |
0 commit comments