Skip to content

Commit f3b6851

Browse files
committed
first sync
0 parents  commit f3b6851

File tree

2 files changed

+167
-0
lines changed

2 files changed

+167
-0
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
precomputed
2+
__pycache__
3+
*pyc

README.md

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# Official Repository for CIReVL [ICLR 2024]
2+
3+
### Vision-by-Language for Training-Free Compositional Image Retrieval
4+
5+
__Authors__: Shyamgopal Karthik*, Karsten Roth*, Massimilano Mancini, Zeynep Akata
6+
7+
[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2310.09291)
8+
<!-- [![GitHub Stars](https://img.shields.io/github/stars/miccunifi/SEARLE?style=social)](https://github.com/miccunifi/SEARLE) -->
9+
<!-- [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zero-shot-composed-image-retrieval-with/zero-shot-composed-image-retrieval-zs-cir-on)](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on?p=zero-shot-composed-image-retrieval-with)\
10+
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zero-shot-composed-image-retrieval-with/zero-shot-composed-image-retrieval-zs-cir-on-1)](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on-1?p=zero-shot-composed-image-retrieval-with)\
11+
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zero-shot-composed-image-retrieval-with/zero-shot-composed-image-retrieval-zs-cir-on-2)](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on-2?p=zero-shot-composed-image-retrieval-with) -->
12+
13+
This repo extends the great code repository of [SEARLE](https://arxiv.org/abs/2303.15247), link [here](https://github.com/miccunifi/SEARLE).
14+
15+
---
16+
17+
## Overview
18+
19+
### Abstract
20+
21+
Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning.
22+
23+
![](assets/arch.png "Pipeline for Training-Free CIR using Vision-by-Language")
24+
25+
In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.
26+
27+
---
28+
29+
## Table of Contents
30+
31+
- [Setting everything up](#setting-everything-up)
32+
- [Required Conda Environment](#required-conda-environment)
33+
- [Required Datasets](#required-datasets)
34+
- [Running and Evaluation CIReVL](#running-and-evaluating-cirevl-on-all-datasets)
35+
- [Citations](#citation)
36+
37+
38+
---
39+
40+
## Setting Everything Up
41+
42+
### Required Conda Environment
43+
44+
After cloning this repository, install the revelant packages using
45+
46+
```sh
47+
conda create -n cirevl -y python=3.8
48+
conda activate cirevl
49+
pip install torch==1.11.0 torchvision==0.12.0 transformers==4.24.0 tqdm termcolor pandas==1.4.2 openai==0.28.0 salesforce-lavis open_clip_torch
50+
pip install git+https://github.com/openai/CLIP.git
51+
```
52+
53+
__Note__ that to use a BLIP(-2) caption model by default, you need access to GPUs that allow for use of `bloat16` (e.g. `A100` types).
54+
55+
### Required Datasets
56+
57+
#### FashionIQ
58+
59+
Download the FashionIQ dataset following the instructions in
60+
the [**official repository**](https://github.com/XiaoxiaoGuo/fashion-iq).
61+
After downloading the dataset, ensure that the folder structure matches the following:
62+
63+
```
64+
├── FASHIONIQ
65+
│ ├── captions
66+
| | ├── cap.dress.[train | val | test].json
67+
| | ├── cap.toptee.[train | val | test].json
68+
| | ├── cap.shirt.[train | val | test].json
69+
70+
│ ├── image_splits
71+
| | ├── split.dress.[train | val | test].json
72+
| | ├── split.toptee.[train | val | test].json
73+
| | ├── split.shirt.[train | val | test].json
74+
75+
│ ├── images
76+
| | ├── [B00006M009.jpg | B00006M00B.jpg | B00006M6IH.jpg | ...]
77+
```
78+
79+
#### CIRR
80+
81+
Download the CIRR dataset following the instructions in the [**official repository**](https://github.com/Cuberick-Orion/CIRR).
82+
After downloading the dataset, ensure that the folder structure matches the following:
83+
84+
```
85+
├── CIRR
86+
│ ├── train
87+
| | ├── [0 | 1 | 2 | ...]
88+
| | | ├── [train-10108-0-img0.png | train-10108-0-img1.png | ...]
89+
90+
│ ├── dev
91+
| | ├── [dev-0-0-img0.png | dev-0-0-img1.png | ...]
92+
93+
│ ├── test1
94+
| | ├── [test1-0-0-img0.png | test1-0-0-img1.png | ...]
95+
96+
│ ├── cirr
97+
| | ├── captions
98+
| | | ├── cap.rc2.[train | val | test1].json
99+
| | ├── image_splits
100+
| | | ├── split.rc2.[train | val | test1].json
101+
```
102+
103+
#### CIRCO
104+
105+
Download the CIRCO dataset following the instructions in the [**official repository**](https://github.com/miccunifi/CIRCO).
106+
After downloading the dataset, ensure that the folder structure matches the following:
107+
108+
```
109+
├── CIRCO
110+
│ ├── annotations
111+
| | ├── [val | test].json
112+
113+
│ ├── COCO2017_unlabeled
114+
| | ├── annotations
115+
| | | ├── image_info_unlabeled2017.json
116+
| | ├── unlabeled2017
117+
| | | ├── [000000243611.jpg | 000000535009.jpg | ...]
118+
```
119+
120+
121+
#### GeneCIS
122+
123+
124+
125+
126+
---
127+
128+
## Running and Evaluating CIReVL on all Datasets
129+
130+
Exemplary runs to compute all relevant evaluation metrics across all four benchmark datasets are provided in `example_benchmark_runs/example_benchmark_runs.sh` .
131+
132+
For example, to compute retrieval metrics on the Fashion-IQ Dress subset, simply run:
133+
134+
```sh
135+
datapath=/mnt/datasets_r/FASHIONIQ
136+
python src/main.py --dataset fashioniq_dress --split val --dataset-path $datapath --preload img_features captions mods --llm_prompt prompts.structural_modifier_prompt_fashion --clip ViT-B-32
137+
```
138+
139+
This call to `src/main.py` includes the majority of relevant handles:
140+
141+
```sh
142+
--dataset [name_of_dataset] #Specific dataset to use, s.a. cirr, circo, fashioniq_dress, fashioniq_shirt (...)
143+
--split [val_or_test] #Compute either validation metrics, or generate a test submission file where needed (cirr, circo).
144+
--dataset-path [path_to_dataset_folder]
145+
--preload [list_of_things_to_save_and_preload_if_available] #One can pass img_features, captions and mods (modified captions). Depending on which is passed, the correspondingly generated img_features, BLIP-captions and LLM-modified captions will be stored. If the script is called again using the same parameters, the saved data is loaded instead - which is much quicker. This is particularly useful when switching different models (such as the llm for different modified captions, or the retrieval model via img_features).
146+
--llm_prompt [prompts.name_of_prompt_str] #LLM prompt to use.
147+
--clip [name_of_openclip_model] #OpenCLIP model to use for crossmodal retrieval.
148+
```
149+
150+
---
151+
152+
## Citation
153+
154+
```bibtex
155+
@misc{karthik2023visionbylanguage,
156+
title={Vision-by-Language for Training-Free Compositional Image Retrieval},
157+
author={Shyamgopal Karthik and Karsten Roth and Massimiliano Mancini and Zeynep Akata},
158+
year={2023},
159+
eprint={2310.09291},
160+
archivePrefix={arXiv},
161+
primaryClass={cs.CV}
162+
}
163+
```
164+

0 commit comments

Comments
 (0)