Pengfei Luoβ , Jingbo Zhouβ‘, Tong Xuβ , Yuan Xiaβ‘, Linli Xuβ , Enhong Chenβ
β University of Science and Technology of China
β‘ Baidu Inc
Create virtual environment:
conda create -n ImageScope python=3.10.14
conda activate ImageScopeInstall Pytorch
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu118Install other libraries:
pip install xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu118
pip install https://github.com/vllm-project/vllm/releases/download/v0.5.4/vllm-0.5.4+cu118-cp310-cp310-manylinux1_x86_64.whl
pip install -r requirements.txtPut all dataset in a folder data as follows:
./data
βββ CIRCO
βββ CIRR
βββ FashionIQ
βββ Flickr30K
βββ MSCOCO
βββ VisDialPlease follow the instruction of the CIRCO official repository
miccunifi/CIRCO to prepare the dataset.
Move the folder unlabeled2017 and the folder CIRCO structure should look like:
./CIRCO
βββ captions
β βββ val.json
β βββ test.json
βββ unlabeled2017
βββ 000000572834.jpg
βββ 000000088597.jpg
βββ 000000386336.jpg
βββ ...Please follow the instruction of the CIRR official repository Cuberick-Orion/CIRR to prepare the dataset.
Make the folder CIRR structure looks as follows:
./CIRR
βββ LICENSE
βββ README.md
βββ captions_ext
β βββ cap.ext.rc2.test1.json
β βββ cap.ext.rc2.train.json
β βββ cap.ext.rc2.val.json
βββ image_splits
β βββ split.rc2.val.json
β βββ split.rc2.train.json
β βββ split.rc2.test1.json
βββ dev
β βββ dev-841-3-img0.png
β βββ dev-30-2-img1.png
β βββ dev-954-2-img1.png
β βββ ...
βββ captions
β βββ cap.rc2.train.json
β βββ ._cap.rc2.val.json
β βββ cap.rc2.val.json
β βββ ...
βββ test1
βββ test1-1005-3-img0.png
βββ test1-400-0-img1.png
βββ test1-718-0-img0.png
βββ ...Download and extract files form π€HuggingFace - Plachta/FashionIQ, and organize the folder FashionIQ like:
./FashionIQ
βββ image_splits
β βββ split.dress.val.json
β βββ split.toptee.val.json
β βββ split.dress.train.json
β βββ ...
βββ captions
β βββ cap.shirt.test.json
β βββ cap.shirt.val.json
β βββ cap.toptee.test.json
β βββ ...
βββ images
βββ B0088D23WY.png
βββ B000QB12QY.png
βββ B001I90CD2.png
βββ ...Download and extract files form π€HuggingFace - nlphuji/flickr_1k_test_image_text_retrieval, and organize the folder Flickr30K like:
./Flickr30K
βββ README.md
βββ test_1k_flickr.csv
βββ images_flickr_1k_test.zip
βββ test_1k_flickr.csv
βββ images
βββ 2847514745.jpg
βββ 4689169924.jpg
βββ 2088705195.jpg
βββ ..Download and extract files form π€HuggingFace - nlphuji/mscoco_2014_5k_test_image_text_retrieval, and organize the folder MSCOCO like:
./MSCOCO
βββ README.md
βββ test_5k_mscoco_2014.csv
βββ mscoco_2014_5k_test_image_text_retrieval.py
βββ images_mscoco_2014_5k_test.zip
βββ .gitattributes
βββ images
βββ COCO_val2014_000000466052.jpg
βββ COCO_val2014_000000335631.jpg
βββ COCO_val2014_000000297972.jpg
βββ ...Obtain Protocal/Search_Space_val_50k.json and dialogues/VisDial_v1.0_queries_val.json from the Saehyung-Lee/PlugIR repository. Download the images COCO 2017 Unlabeled Images. Place the downloaded files in the folder
VisDial and organize it as follows:
./VisDial
βββ Search_Space_val_50k.json
βββ VisDial_v1.0_queries_val.json
βββ unlabeled2017
βββ 000000572834.jpg
βββ 000000088597.jpg
βββ 000000386336.jpg
βββ ...
Once you have completed these steps, your dataset is ready for use.
Download the pre-trained model weights from the links provided below.
| Role | Model | Link |
|---|---|---|
| Captioner | LLaVA-v1.6-vicuna-7B | π€llava-hf/llava-v1.6-vicuna-7b-hf |
| Reasoner | LLaMA3-8B-Instruct | π€meta-llama/Meta-Llama-3-8B-Instruct |
| Verifier | PaliGemma-3B-mix-224 | π€google/paligemma-3b-mix-224 |
| Evaluator | InternVL2-8B | π€OpenGVLab/InternVL2-8B |
| VLM | CLIP-ViT-B-32-laion2B-s34B-b79K CLIP-ViT-L-14-laion2B-s32B-b82K |
π€laion/CLIP-ViT-B-32-laion2B-s34B-b79K π€laion/CLIP-ViT-L-14-laion2B-s32B-b82K |
You can place the downloaded weights in a directory of your choice, and specify the path to the models in the configuration or script when running the pipeline.
To run inference on a specific dataset, modify the model path in the corresponding script located in the script/run_{dataset_name}.sh folder. Replace the placeholder with your actual model path. Once updated, execute the script using the command bash script/run_{dataset_name}.sh to initiate the inference process.
By default, the script utilizes all available GPUs. If you wish to restrict GPU usage, manually configure the CUDA_VISIBLE_DEVICES environment variable. On the first run, the pipeline will automatically create an image_db directory to store image captions and embeddings for retrieval purposes.
Note: For the CIRR subset setting, you need to include the --subset flag in the command within the script.
After completing the inference, evaluation metrics (for FashionIQ, Flickr30K, MSCOCO, and VisDial) or submission files (for CIRCO and CIRR) can be found in the runs folder. Metrics are logged in files located at runs/{dataset_name}/{runs_name}/{clip_version}-{timestamp}/output.log, while prediction results are saved as JSON files at runs/{dataset_name}/{runs_name}/{clip_version}-{timestamp}/{timestamp}_{dataset_name}_test_stage3_eval.json. You can submit these JSON files to the CIRR Evaluation Server or CIRCO Evaluation Server to obtain the final evaluation results.
If you find our paper and code are useful in your research, please cite it as follows:
@inproceedings{luo2025imagescope,
title={ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning},
author={Luo, Pengfei and Zhou, Jingbo and Xu, Tong and Xia, Yuan and Xu, Linli and Chen, Enhong},
booktitle={The Web Conference 2025},
year={2025}
}

