Contributors: Ali Nafisi, Hossein Shakibania
This repository contains our solution for the Compositional Image Retrieval Challenge, part of the Rayan International AI Contest. The challenge aims to develop a system capable of retrieving the most relevant image from a database by understanding a combination of visual and textual inputs.
The task requires building a system that can:
-
Process input consisting of:
- A reference image containing a visual scene
- A text description providing additional context or modifications
-
Identify and return the single most relevant matching image from a provided database.
The figure below serves as an example of this task:
-
📏 Maximum Model Size: 4GB
-
🚫 Not Allowed in Final Model for Inference:
- ❌ Large Language Models (LLMs)
- ❌ Object detection models
- ❌ Pre-trained models that directly solve the task without modifications
-
✅ Allowed:
- ✔️ Pre-trained Vision-Language Models (e.g., CLIP), if fine-tuned for this task
Our approach leverages natural language processing and vision-language models to achieve compositional retrieval in an efficient and innovative manner, while adhering to contest constraints.
We wanted to have a model that can identify the objects in the query text that will be added and removed from the query image, directly influencing the target image.
To create a dataset for our task, we utilized free versions of LLMs such as Gemini, GPT, and Claude. These models generated 635 unique templates resembling the Query Text, available in the prompt_templates.json file. Using predefined objects, we expanded these templates into 15 variations each, resulting in a dataset of 9,525 instances.
Then we fine-tuned the DistilBERT language model on the curated dataset. The model classifies tokens in the query text into three categories:
- Positive (pos): Objects to be added to the query image.
- Negative (neg): Objects to be removed from the query image.
- Other: Articles, verbs, punctuations, or irrelevant terms.
An example output demonstrates the DistilBERT model’s capability to identify actionable tokens:
"Take out the jacket and the sketchpad; add a laptop, a watering can, and a basket."
other: take, other: out, other: the, neg: jacket, other: and, other: the, neg: sketchpad, other: ;, other: add, other: a, pos: laptop, other: ,, other: a, pos: watering, pos: can, other: ,, other: and, other: a, pos: basket, other: .
Using this classification, we generate embeddings for each positive and negative object using a template, "a photo of a <object>.". These embeddings are later used to refine the query image embedding.
We finetuned a variant of the ViTamin model from OpenCLIP, a robust vision-language model, as the backbone for extracting features from both textual and visual modalities. The core innovation lies in modifying the query image embedding:
- Adding embeddings of positive objects derived from the query text.
- Subtracting embeddings of negative objects.
This process dynamically adjusts the query embedding to closely represent the target image's characteristics. By avoiding reliance on object detection, image captioning, or directly-used LLMs, our approach remains computationally efficient while adhering to contest constraints.
Our solution for this Challenge achieved outstanding results. The evaluation metric for this challenge is Accuracy, with models tested on a private test dataset. Our model achieved the highest score, outperforming competitors with a noticeable margin.
The table below presents a summary of the Top 🔟 teams and their respective accuracy scores:
| Rank | Team | Accuracy (%) |
|---|---|---|
| 🥇 | No Trust Issues Here (Our Team) | 95.38 |
| 🥈 | Pileh | 84.61 |
| 🥉 | AI Guardians of Trust | 88.59 |
| 4 | AIUoK | 87.30 |
| 5 | red_serotonin | 86.90 |
| 6 | GGWP | 85.70 |
| 7 | Persistence | 85.20 |
| 8 | AlphaQ | 84.50 |
| 9 | Tempest | 83.90 |
| 10 | Scientific | 82.70 |
Follow these instructions to set up your environment and execute the training pipeline.
git clone git@github.com:safinal/compositional-image-retrieval.git
cd compositional-image-retrievalWe recommend using a virtual environment to manage dependencies.
Using venv:
python -m venv venv
source venv/bin/activate # On macOS/Linux
venv\Scripts\activate # On WindowsUsing conda:
conda create --name compositional-image-retrieval python=3.8 -y
conda activate compositional-image-retrievalInstall all required libraries from the requirements.txt file:
pip install -r requirements.txtDistilBert:
python run.py --model_type token_cls --config ./config/token_cls_cfg.yamlRetrieval Model:
python run.py --model_type retrieval --config ./config/retrieval_cfg.yamlWe thank the authors of DistilBERT and ViTamin and the creators of OpenCLIP for their invaluable contributions to the development of vision-language models.
We welcome contributions from the community to make this repository better!
