Skip to content

This repository contains the official implementation code of NeurIPS 2025 paper: "Instance-Level Composed Image Retrieval".

License

Notifications You must be signed in to change notification settings

billpsomas/icir

Repository files navigation

i-CIR: Instance-Level Composed Image Retrieval (NeurIPS 2025)

Bill Psomas1†, George Retsinas2†, Nikos Efthymiadis1, Panagiotis Filntisis2,4
Yannis Avrithis, Petros Maragos2,3,4, Ondrej Chum1, Giorgos Tolias1

1Visual Recognition Group, FEE, Czech Technical University in Prague 2Robotics Institute, Athena Research Center
3National Technical University of Athens 4HERON - Hellenic Robotics Center of Excellence

Hugging Face Project Page Hugging Face arXiv OpenReview

Dataset Version Dataset License Code License: MIT Python

Official implementation of our Baseline Approach for SurprIsingly strong Composition (BASIC) and the instance-level composed image retrieval (i-CIR) dataset.

TL;DR: We introduce BASIC, a training-free VLM-based method that centers and projects image embeddings, and i-CIR, a well-curated, instance-level composed image retrieval benchmark with rich hard negatives that is compact yet really hard.

Contents

  1. News
  2. Overview
  3. Download the i-CIR dataset
  4. Installation
  5. Quick Start
  6. Methods
  7. Key Parameters
  8. Corpus Files
  9. Output
  10. Results
  11. Project Structure
  12. Citation
  13. License
  14. Acknowledgments
  15. Contact

News

  • 20/12/2025: 🤗 HuggingFace WebDataset is now supported. You can now find i-CIR [here].
  • 5/12/2025: i-CIR is presented at NeurIPS 2025! 🎉 Go now through [poster].

Overview

This repository contains a clean implementation for performing composed image retrieval (CIR) on i-CIR dataset using vision-language models (CLIP/SigLIP).

Method (BASIC)

Our BASIC method decomposes multimodal queries into object and style components through:

  1. Feature Standardization: Centering features using LAION-1M statistics
  2. Contrastive PCA Projection: Separating information using positive and negative text corpora
  3. Query Expansion: Refining queries with top-k similar database images
  4. Harris Corner Fusion: Combining image and text similarities with geometric weighting

EP illustration

Dataset

Well-curated

i-CIR is an instance-level composed image retrieval benchmark where each instance is a specific, visually indistinguishable object (e.g., Temple of Poseidon). Each query composes an image of the instance with a text modification. For every instance we curate a shared database and define composed positives plus a rich set of hard negativesvisual (same/similar object, wrong text), textual (right text semantics, different instance—often same category), and composed (nearly matches both parts but fails one).

EP illustration

Compact but hard

Built by combining human curation with automated retrieval from LAION, followed by filtering (quality/duplicates/PII) and manual verification of positives and hard negatives, i-CIR is compact yet challenging: it rivals searching with >40M distractor images for simple baselines, while keeping per-query databases manageable. Key stats:

  • Instances: 202
  • Total images: ~750K
  • Composed queries: 1,883
  • Image queries / instance: 1–46
  • Text queries / instance: 1–5
  • Positives / composed query: 1–127
  • Hard negatives / instance: 951–10,045
  • Avg database size / query: ~3.7K images

Truly compositional

Performance peaks at interior text–image fusion weights ($\lambda$) and shows large composition gains over the best uni-modal baselines—evidence that both modalities must work together.

EP illustration

Download the i-CIR dataset

i-CIR is available in two equivalent formats:

Option A — Direct tarball (local folder layout)

i-CIR is stored here.

# Download 
wget https://vrg.fel.cvut.cz/icir/icir_v1.0.0.tar.gz -O icir_v1.0.0.tar.gz
# Extract
tar -xzf icir_v1.0.0.tar.gz
# Verify
sha256sum -c icir_v1.0.0.sha256   # should print OK

Reulting layout (folder-based):

icir/
├── database/
├── query/
├── database_files.csv
├── query_files.csv
├── VERSION.txt
├── LICENSE
└── checksums.sha256

Option B — Hugging Face Hub (WebDataset shards)

You can also download i-CIR directly from the Hugging Face Hub as WebDataset tar shards (recommended for more robust downloading).

CLI:

# Install HF tooling
pip install -U huggingface_hub

# (Optional) login if the repo is gated/private
huggingface-cli login

# Download the dataset snapshot locally
huggingface-cli download billpsomas/icir \
  --repo-type dataset \
  --local-dir ./data/icir \
  --revision main

Python (equivalent):

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="billpsomas/icir",
    repo_type="dataset",
    revision="main",
    local_dir="./data/icir",
)
print("Downloaded to:", local_dir)

Resulting layout (WebDataset-based):

icir/
├── webdataset/
│   ├── query/
│   │   ├── query-000000.tar
│   │   ├── query-000001.tar
│   │   └── ...
│   └── database/
│       ├── database-000000.tar
│       ├── database-000001.tar
│       └── ...
├── annotations/
│   ├── query_files.csv
│   ├── database_files.csv
├── VERSION.txt
└── LICENSE

You do not need to extract images to a database/ and query/ folder for this option; feature extraction reads directly from the WebDataset shards.

Installation

Requirements

  • Python 3.9+
  • PyTorch 2.0+
  • CUDA-capable GPU (recommended)
  • (Optional, for Hugging Face / WebDataset mode) huggingface_hub + webdataset

Setup

# Clone the repository
git clone https://github.com/billpsomas/icir.git
cd icir

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Quick Start

1. Prepare Data

Ensure you have the following structure:

icir/
├── data/
│   ├── icir/                       # i-CIR dataset (local folder layout or WebDataset shards)
│   └── laion_mean/                 # Pre-computed LAION means
├── corpora/
│   ├── generic_subjects.csv        # Positive corpus (objects)
│   └── generic_styles.csv          # Negative corpus (styles)
└── synthetic_data/                 # Score normalization data
    ├── dataset_1_sd_clip.pkl.npy
    └── dataset_1_sd_siglip.pkl.npy

2. Extract Features

Extract features for the i-CIR dataset and text corpora:

# Extract i-CIR dataset features (local folder layout)
python3 create_features.py --dataset icir --icir_source folder --backbone clip --batch 512 --gpu 0

# Extract i-CIR dataset features (WebDataset shards)
python3 create_features.py --dataset icir --icir_source wds --backbone clip --batch 512 --gpu 0

# Extract corpus features
python3 create_features.py --dataset corpus --backbone clip --batch 512 --gpu 0

Features will be saved to features/{backbone}_features/.

3. Run Retrieval

The easiest way is to use method presets with --use_preset:

# Full BASIC method (recommended)
python3 run_retrieval.py --method basic --use_preset

# Baseline methods
python3 run_retrieval.py --method sum --use_preset
python3 run_retrieval.py --method product --use_preset
python3 run_retrieval.py --method image --use_preset
python3 run_retrieval.py --method text --use_preset

For advanced usage with custom parameters:

python3 run_retrieval.py \
  --method basic \
  --backbone clip \
  --dataset icir \
  --results_dir results/ \
  --specified_corpus generic_subjects \
  --specified_ncorpus generic_styles \
  --num_principal_components_for_projection 250 \
  --aa 0.2 \
  --standardize_features \
  --use_laion_mean \
  --project_features \
  --do_query_expansion \
  --contextualize \
  --normalize_similarities \
  --path_to_synthetic_data ./synthetic_data \
  --harris_lambda 0.1

Methods

The codebase implements several retrieval methods:

  • basic: Full decomposition method with all components (PCA projection, query expansion, Harris fusion)
  • sum: Simple sum of image and text similarities
  • product: Simple product of image and text similarities
  • image: Image-only retrieval (ignores text)
  • text: Text-only retrieval (ignores image)

Key Parameters

  • --method: Retrieval method (basic, sum, product, image, text)
  • --backbone: Vision-language model (clip for ViT-L/14, siglip for ViT-L-16-SigLIP-256)
  • --use_preset: Use predefined method configurations (recommended)
  • --specified_corpus: Positive corpus for projection (default: generic_subjects)
  • --specified_ncorpus: Negative corpus for projection (default: generic_styles)
  • --num_principal_components_for_projection: PCA components, >1 for exact count or <1 for energy threshold (default: 250)
  • --aa: Negative corpus weight in contrastive PCA (default: 0.2)
  • --harris_lambda: Harris fusion parameter (default: 0.1)
  • --contextualize: Add corpus objects to the text query to contextualize the query
  • --standardize_features: Center features before projection
  • --use_laion_mean: Use pre-computed LAION mean for centering
  • --project_features: Apply PCA projection
  • --do_query_expansion: Expand queries with retrieved images
  • --normalize_similarities: Apply score normalization using synthetic data

Corpus Files

Text corpora define semantic spaces for PCA projection:

  • generic_subjects.csv: General object/subject descriptions (positive corpus)
  • generic_styles.csv: General style/attribute descriptions (negative corpus)

Corpora are CSV files with a single column of text descriptions, loaded from the corpora/ directory.

Output

Results are saved to the specified results directory (default: results/):

results/
└── icir/
    └── {method_variant}/
        └── mAP_table.csv          # Mean Average Precision results

Each result file includes:

  • mAP score for the retrieval method
  • Configuration parameters used (for basic method only)
  • Timestamp of the experiment

Results (mAP %)

Method ImageNet-R NICO Mini-DN LTLL i-CIR
Text 0.74 1.09 0.57 5.72 3.01
Image 3.84 6.32 6.66 16.49 3.04
Text + Image 6.21 9.30 9.33 17.86 8.20
Text × Image 7.83 9.79 9.86 23.16 17.48
WeiCom 10.47 10.54 8.52 26.60 18.03
PicWord 7.88 9.76 12.00 21.27 19.36
CompoDiff 12.88 10.32 22.95 21.61 9.63
CIReVL 18.11 17.80 26.20 32.60 18.66
Searle 14.04 15.13 21.78 25.46 19.90
MCL 8.13 19.09 18.41 16.67 19.89
MagicLens 9.13 19.66 20.06 24.21 27.35
CoVR 11.52 24.93 27.76 24.68 28.50
FREEDOM 29.91 26.10 37.27 33.24 17.24
FREEDOM† 25.81 23.24 32.14 30.82 15.76
BASIC 32.13 31.65 39.58 41.38 31.64
BASIC† 27.54 28.90 35.75 38.22 34.35

† Without query expansion.

Project Structure

icir/
├── run_retrieval.py           # Main retrieval script
├── create_features.py         # Feature extraction script
├── utils.py                   # General utilities (device setup, text processing, evaluation)
├── utils_features.py          # Feature I/O and model loading
├── utils_retrieval.py         # Core retrieval algorithms
├── requirements.txt           # Python dependencies
├── README.md                  # This file
├── LICENSE                    # MIT License
├── data/                      # Dataset and normalization data
├── corpora/                   # Text corpus files
├── features/                  # Extracted features (generated)
└── results/                   # Retrieval results (generated)

Citation

If you found BASIC and/or i-CIR useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!

@inproceedings{
    psomas2025instancelevel,
    title={Instance-Level Composed Image Retrieval},
    author={Bill Psomas and George Retsinas and Nikos Efthymiadis and Panagiotis Filntisis and Yannis Avrithis and Petros Maragos and Ondrej Chum and Giorgos Tolias},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025}
}

License

  • This code is licensed under the MIT License - see the LICENSE file for details.
  • This dataset is licensed under the CC-BY-NC-SA License - see dataset's LICENSE file for details.

Acknowledgments

  • Vision-language models via OpenCLIP
  • LAION-1M statistics for feature standardization

Contact

For questions or issues, please open an issue on GitHub or contact Bill $\rightarrow$ vasileios.psomas@fel.cvut.cz.

About

This repository contains the official implementation code of NeurIPS 2025 paper: "Instance-Level Composed Image Retrieval".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages