Bill Psomas1†, George Retsinas2†, Nikos Efthymiadis1, Panagiotis Filntisis2,4
Yannis Avrithis, Petros Maragos2,3,4, Ondrej Chum1, Giorgos Tolias1
1Visual Recognition Group, FEE, Czech Technical University in Prague 2Robotics Institute, Athena Research Center
3National Technical University of Athens 4HERON - Hellenic Robotics Center of Excellence
Official implementation of our Baseline Approach for SurprIsingly strong Composition (BASIC) and the instance-level composed image retrieval (i-CIR) dataset.
TL;DR: We introduce BASIC, a training-free VLM-based method that centers and projects image embeddings, and i-CIR, a well-curated, instance-level composed image retrieval benchmark with rich hard negatives that is compact yet really hard.
- News
- Overview
- Download the i-CIR dataset
- Installation
- Quick Start
- Methods
- Key Parameters
- Corpus Files
- Output
- Results
- Project Structure
- Citation
- License
- Acknowledgments
- Contact
- 20/12/2025: 🤗 HuggingFace WebDataset is now supported. You can now find i-CIR [
here]. - 5/12/2025: i-CIR is presented at NeurIPS 2025! 🎉 Go now through [
poster].
This repository contains a clean implementation for performing composed image retrieval (CIR) on i-CIR dataset using vision-language models (CLIP/SigLIP).
Our BASIC method decomposes multimodal queries into object and style components through:
- Feature Standardization: Centering features using LAION-1M statistics
- Contrastive PCA Projection: Separating information using positive and negative text corpora
- Query Expansion: Refining queries with top-k similar database images
- Harris Corner Fusion: Combining image and text similarities with geometric weighting
i-CIR is an instance-level composed image retrieval benchmark where each instance is a specific, visually indistinguishable object (e.g., Temple of Poseidon). Each query composes an image of the instance with a text modification. For every instance we curate a shared database and define composed positives plus a rich set of hard negatives—visual (same/similar object, wrong text), textual (right text semantics, different instance—often same category), and composed (nearly matches both parts but fails one).
Built by combining human curation with automated retrieval from LAION, followed by filtering (quality/duplicates/PII) and manual verification of positives and hard negatives, i-CIR is compact yet challenging: it rivals searching with >40M distractor images for simple baselines, while keeping per-query databases manageable. Key stats:
- Instances: 202
- Total images: ~750K
- Composed queries: 1,883
- Image queries / instance: 1–46
- Text queries / instance: 1–5
- Positives / composed query: 1–127
- Hard negatives / instance: 951–10,045
- Avg database size / query: ~3.7K images
Performance peaks at interior text–image fusion weights (
i-CIR is available in two equivalent formats:
i-CIR is stored here.
# Download
wget https://vrg.fel.cvut.cz/icir/icir_v1.0.0.tar.gz -O icir_v1.0.0.tar.gz
# Extract
tar -xzf icir_v1.0.0.tar.gz
# Verify
sha256sum -c icir_v1.0.0.sha256 # should print OKReulting layout (folder-based):
icir/
├── database/
├── query/
├── database_files.csv
├── query_files.csv
├── VERSION.txt
├── LICENSE
└── checksums.sha256
You can also download i-CIR directly from the Hugging Face Hub as WebDataset tar shards (recommended for more robust downloading).
CLI:
# Install HF tooling
pip install -U huggingface_hub
# (Optional) login if the repo is gated/private
huggingface-cli login
# Download the dataset snapshot locally
huggingface-cli download billpsomas/icir \
--repo-type dataset \
--local-dir ./data/icir \
--revision mainPython (equivalent):
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="billpsomas/icir",
repo_type="dataset",
revision="main",
local_dir="./data/icir",
)
print("Downloaded to:", local_dir)Resulting layout (WebDataset-based):
icir/
├── webdataset/
│ ├── query/
│ │ ├── query-000000.tar
│ │ ├── query-000001.tar
│ │ └── ...
│ └── database/
│ ├── database-000000.tar
│ ├── database-000001.tar
│ └── ...
├── annotations/
│ ├── query_files.csv
│ ├── database_files.csv
├── VERSION.txt
└── LICENSE
You do not need to extract images to a database/ and query/ folder for this option; feature extraction reads directly from the WebDataset shards.
- Python 3.9+
- PyTorch 2.0+
- CUDA-capable GPU (recommended)
- (Optional, for Hugging Face / WebDataset mode)
huggingface_hub+webdataset
# Clone the repository
git clone https://github.com/billpsomas/icir.git
cd icir
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtEnsure you have the following structure:
icir/
├── data/
│ ├── icir/ # i-CIR dataset (local folder layout or WebDataset shards)
│ └── laion_mean/ # Pre-computed LAION means
├── corpora/
│ ├── generic_subjects.csv # Positive corpus (objects)
│ └── generic_styles.csv # Negative corpus (styles)
└── synthetic_data/ # Score normalization data
├── dataset_1_sd_clip.pkl.npy
└── dataset_1_sd_siglip.pkl.npy
Extract features for the i-CIR dataset and text corpora:
# Extract i-CIR dataset features (local folder layout)
python3 create_features.py --dataset icir --icir_source folder --backbone clip --batch 512 --gpu 0
# Extract i-CIR dataset features (WebDataset shards)
python3 create_features.py --dataset icir --icir_source wds --backbone clip --batch 512 --gpu 0
# Extract corpus features
python3 create_features.py --dataset corpus --backbone clip --batch 512 --gpu 0Features will be saved to features/{backbone}_features/.
The easiest way is to use method presets with --use_preset:
# Full BASIC method (recommended)
python3 run_retrieval.py --method basic --use_preset
# Baseline methods
python3 run_retrieval.py --method sum --use_preset
python3 run_retrieval.py --method product --use_preset
python3 run_retrieval.py --method image --use_preset
python3 run_retrieval.py --method text --use_presetFor advanced usage with custom parameters:
python3 run_retrieval.py \
--method basic \
--backbone clip \
--dataset icir \
--results_dir results/ \
--specified_corpus generic_subjects \
--specified_ncorpus generic_styles \
--num_principal_components_for_projection 250 \
--aa 0.2 \
--standardize_features \
--use_laion_mean \
--project_features \
--do_query_expansion \
--contextualize \
--normalize_similarities \
--path_to_synthetic_data ./synthetic_data \
--harris_lambda 0.1The codebase implements several retrieval methods:
- basic: Full decomposition method with all components (PCA projection, query expansion, Harris fusion)
- sum: Simple sum of image and text similarities
- product: Simple product of image and text similarities
- image: Image-only retrieval (ignores text)
- text: Text-only retrieval (ignores image)
--method: Retrieval method (basic,sum,product,image,text)--backbone: Vision-language model (clipfor ViT-L/14,siglipfor ViT-L-16-SigLIP-256)--use_preset: Use predefined method configurations (recommended)--specified_corpus: Positive corpus for projection (default:generic_subjects)--specified_ncorpus: Negative corpus for projection (default:generic_styles)--num_principal_components_for_projection: PCA components, >1 for exact count or <1 for energy threshold (default: 250)--aa: Negative corpus weight in contrastive PCA (default: 0.2)--harris_lambda: Harris fusion parameter (default: 0.1)--contextualize: Add corpus objects to the text query to contextualize the query--standardize_features: Center features before projection--use_laion_mean: Use pre-computed LAION mean for centering--project_features: Apply PCA projection--do_query_expansion: Expand queries with retrieved images--normalize_similarities: Apply score normalization using synthetic data
Text corpora define semantic spaces for PCA projection:
- generic_subjects.csv: General object/subject descriptions (positive corpus)
- generic_styles.csv: General style/attribute descriptions (negative corpus)
Corpora are CSV files with a single column of text descriptions, loaded from the corpora/ directory.
Results are saved to the specified results directory (default: results/):
results/
└── icir/
└── {method_variant}/
└── mAP_table.csv # Mean Average Precision results
Each result file includes:
- mAP score for the retrieval method
- Configuration parameters used (for basic method only)
- Timestamp of the experiment
| Method | ImageNet-R | NICO | Mini-DN | LTLL | i-CIR |
|---|---|---|---|---|---|
| Text | 0.74 | 1.09 | 0.57 | 5.72 | 3.01 |
| Image | 3.84 | 6.32 | 6.66 | 16.49 | 3.04 |
| Text + Image | 6.21 | 9.30 | 9.33 | 17.86 | 8.20 |
| Text × Image | 7.83 | 9.79 | 9.86 | 23.16 | 17.48 |
| WeiCom | 10.47 | 10.54 | 8.52 | 26.60 | 18.03 |
| PicWord | 7.88 | 9.76 | 12.00 | 21.27 | 19.36 |
| CompoDiff | 12.88 | 10.32 | 22.95 | 21.61 | 9.63 |
| CIReVL | 18.11 | 17.80 | 26.20 | 32.60 | 18.66 |
| Searle | 14.04 | 15.13 | 21.78 | 25.46 | 19.90 |
| MCL | 8.13 | 19.09 | 18.41 | 16.67 | 19.89 |
| MagicLens | 9.13 | 19.66 | 20.06 | 24.21 | 27.35 |
| CoVR | 11.52 | 24.93 | 27.76 | 24.68 | 28.50 |
| FREEDOM | 29.91 | 26.10 | 37.27 | 33.24 | 17.24 |
| FREEDOM† | 25.81 | 23.24 | 32.14 | 30.82 | 15.76 |
| BASIC | 32.13 | 31.65 | 39.58 | 41.38 | 31.64 |
| BASIC† | 27.54 | 28.90 | 35.75 | 38.22 | 34.35 |
† Without query expansion.
icir/
├── run_retrieval.py # Main retrieval script
├── create_features.py # Feature extraction script
├── utils.py # General utilities (device setup, text processing, evaluation)
├── utils_features.py # Feature I/O and model loading
├── utils_retrieval.py # Core retrieval algorithms
├── requirements.txt # Python dependencies
├── README.md # This file
├── LICENSE # MIT License
├── data/ # Dataset and normalization data
├── corpora/ # Text corpus files
├── features/ # Extracted features (generated)
└── results/ # Retrieval results (generated)
If you found BASIC and/or i-CIR useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!
@inproceedings{
psomas2025instancelevel,
title={Instance-Level Composed Image Retrieval},
author={Bill Psomas and George Retsinas and Nikos Efthymiadis and Panagiotis Filntisis and Yannis Avrithis and Petros Maragos and Ondrej Chum and Giorgos Tolias},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}- This code is licensed under the MIT License - see the LICENSE file for details.
- This dataset is licensed under the CC-BY-NC-SA License - see dataset's LICENSE file for details.
- Vision-language models via OpenCLIP
- LAION-1M statistics for feature standardization
For questions or issues, please open an issue on GitHub or contact Bill



