Skip to content

[WWW 2025 Oral] ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning

Notifications You must be signed in to change notification settings

pengfei-luo/ImageScope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ‘€ ImageScope πŸ‘€

Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning

Accepted at WWW 2025

arXiv

Pengfei Luo†, Jingbo Zhou‑, Tong Xu†, Yuan Xia‑, Linli Xu†, Enhong Chen†

† University of Science and Technology of China
‑ Baidu Inc

Task Image

PWC PWC PWC PWC PWC PWC

πŸš€ Setup

Environment

Create virtual environment:

conda create -n ImageScope python=3.10.14
conda activate ImageScope

Install Pytorch

pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu118

Install other libraries:

pip install xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu118
pip install https://github.com/vllm-project/vllm/releases/download/v0.5.4/vllm-0.5.4+cu118-cp310-cp310-manylinux1_x86_64.whl
pip install -r requirements.txt

Datasets

Put all dataset in a folder data as follows:

./data
β”œβ”€β”€ CIRCO
β”œβ”€β”€ CIRR
β”œβ”€β”€ FashionIQ
β”œβ”€β”€ Flickr30K
β”œβ”€β”€ MSCOCO
└── VisDial

CIRCO

Please follow the instruction of the CIRCO official repository Github Iconmiccunifi/CIRCO to prepare the dataset.

Move the folder unlabeled2017 and the folder CIRCO structure should look like:

./CIRCO
β”œβ”€β”€ captions
β”‚   β”œβ”€β”€ val.json
β”‚   └── test.json
└── unlabeled2017
    β”œβ”€β”€ 000000572834.jpg
    β”œβ”€β”€ 000000088597.jpg
    β”œβ”€β”€ 000000386336.jpg
    β”œβ”€β”€ ...

CIRR

Please follow the instruction of the CIRR official repository Github IconCuberick-Orion/CIRR to prepare the dataset.

Make the folder CIRR structure looks as follows:

./CIRR
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ captions_ext
β”‚   β”œβ”€β”€ cap.ext.rc2.test1.json
β”‚   β”œβ”€β”€ cap.ext.rc2.train.json
β”‚   β”œβ”€β”€ cap.ext.rc2.val.json
β”œβ”€β”€ image_splits
β”‚   β”œβ”€β”€ split.rc2.val.json
β”‚   β”œβ”€β”€ split.rc2.train.json
β”‚   β”œβ”€β”€ split.rc2.test1.json
β”œβ”€β”€ dev
β”‚   β”œβ”€β”€ dev-841-3-img0.png
β”‚   β”œβ”€β”€ dev-30-2-img1.png
β”‚   β”œβ”€β”€ dev-954-2-img1.png
β”‚   β”œβ”€β”€ ...
β”œβ”€β”€ captions
β”‚   β”œβ”€β”€ cap.rc2.train.json
β”‚   β”œβ”€β”€ ._cap.rc2.val.json
β”‚   β”œβ”€β”€ cap.rc2.val.json
β”‚   β”œβ”€β”€ ...
└── test1
    β”œβ”€β”€ test1-1005-3-img0.png
    β”œβ”€β”€ test1-400-0-img1.png
    β”œβ”€β”€ test1-718-0-img0.png
    β”œβ”€β”€ ...

FashionIQ

Download and extract files form πŸ€—HuggingFace - Plachta/FashionIQ, and organize the folder FashionIQ like:

./FashionIQ
β”œβ”€β”€ image_splits
β”‚   β”œβ”€β”€ split.dress.val.json
β”‚   β”œβ”€β”€ split.toptee.val.json
β”‚   β”œβ”€β”€ split.dress.train.json
β”‚   β”œβ”€β”€ ...
β”œβ”€β”€ captions
β”‚   β”œβ”€β”€ cap.shirt.test.json
β”‚   β”œβ”€β”€ cap.shirt.val.json
β”‚   β”œβ”€β”€ cap.toptee.test.json
β”‚   β”œβ”€β”€ ...
└── images
    β”œβ”€β”€ B0088D23WY.png
    β”œβ”€β”€ B000QB12QY.png
    β”œβ”€β”€ B001I90CD2.png
    β”œβ”€β”€ ...

Flickr30K

Download and extract files form πŸ€—HuggingFace - nlphuji/flickr_1k_test_image_text_retrieval, and organize the folder Flickr30K like:

./Flickr30K
β”œβ”€β”€ README.md
β”œβ”€β”€ test_1k_flickr.csv
β”œβ”€β”€ images_flickr_1k_test.zip
β”œβ”€β”€ test_1k_flickr.csv
└── images
    β”œβ”€β”€ 2847514745.jpg
    β”œβ”€β”€ 4689169924.jpg
    β”œβ”€β”€ 2088705195.jpg
    β”œβ”€β”€ ..

MSCOCO

Download and extract files form πŸ€—HuggingFace - nlphuji/mscoco_2014_5k_test_image_text_retrieval, and organize the folder MSCOCO like:

./MSCOCO
β”œβ”€β”€ README.md
β”œβ”€β”€ test_5k_mscoco_2014.csv
β”œβ”€β”€ mscoco_2014_5k_test_image_text_retrieval.py
β”œβ”€β”€ images_mscoco_2014_5k_test.zip
β”œβ”€β”€ .gitattributes
└── images
    β”œβ”€β”€ COCO_val2014_000000466052.jpg
    β”œβ”€β”€ COCO_val2014_000000335631.jpg
    β”œβ”€β”€ COCO_val2014_000000297972.jpg
    β”œβ”€β”€ ...

VisDial

Obtain Protocal/Search_Space_val_50k.json and dialogues/VisDial_v1.0_queries_val.json from the Github IconSaehyung-Lee/PlugIR repository. Download the images COCO 2017 Unlabeled Images. Place the downloaded files in the folder VisDial and organize it as follows:

./VisDial
β”œβ”€β”€ Search_Space_val_50k.json
β”œβ”€β”€ VisDial_v1.0_queries_val.json
└── unlabeled2017
    β”œβ”€β”€ 000000572834.jpg
    β”œβ”€β”€ 000000088597.jpg
    β”œβ”€β”€ 000000386336.jpg
    β”œβ”€β”€ ...

Once you have completed these steps, your dataset is ready for use.

Models

Download the pre-trained model weights from the links provided below.

Role Model Link
Captioner LLaVA-v1.6-vicuna-7B πŸ€—llava-hf/llava-v1.6-vicuna-7b-hf
Reasoner LLaMA3-8B-Instruct πŸ€—meta-llama/Meta-Llama-3-8B-Instruct
Verifier PaliGemma-3B-mix-224 πŸ€—google/paligemma-3b-mix-224
Evaluator InternVL2-8B πŸ€—OpenGVLab/InternVL2-8B
VLM CLIP-ViT-B-32-laion2B-s34B-b79K
CLIP-ViT-L-14-laion2B-s32B-b82K
πŸ€—laion/CLIP-ViT-B-32-laion2B-s34B-b79K
πŸ€—laion/CLIP-ViT-L-14-laion2B-s32B-b82K

You can place the downloaded weights in a directory of your choice, and specify the path to the models in the configuration or script when running the pipeline.

πŸ“ Inference and Evaluation

Inference

To run inference on a specific dataset, modify the model path in the corresponding script located in the script/run_{dataset_name}.sh folder. Replace the placeholder with your actual model path. Once updated, execute the script using the command bash script/run_{dataset_name}.sh to initiate the inference process.

By default, the script utilizes all available GPUs. If you wish to restrict GPU usage, manually configure the CUDA_VISIBLE_DEVICES environment variable. On the first run, the pipeline will automatically create an image_db directory to store image captions and embeddings for retrieval purposes.

Note: For the CIRR subset setting, you need to include the --subset flag in the command within the script.

Evaluation

After completing the inference, evaluation metrics (for FashionIQ, Flickr30K, MSCOCO, and VisDial) or submission files (for CIRCO and CIRR) can be found in the runs folder. Metrics are logged in files located at runs/{dataset_name}/{runs_name}/{clip_version}-{timestamp}/output.log, while prediction results are saved as JSON files at runs/{dataset_name}/{runs_name}/{clip_version}-{timestamp}/{timestamp}_{dataset_name}_test_stage3_eval.json. You can submit these JSON files to the CIRR Evaluation Server or CIRCO Evaluation Server to obtain the final evaluation results.

Experimental Results

Exp Result CIR

Exp Result TIR

πŸ“š Citation

If you find our paper and code are useful in your research, please cite it as follows:

@inproceedings{luo2025imagescope,
  title={ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning},
  author={Luo, Pengfei and Zhou, Jingbo and Xu, Tong and Xia, Yuan and Xu, Linli and Chen, Enhong},
  booktitle={The Web Conference 2025},
  year={2025}
}

About

[WWW 2025 Oral] ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •