⭐️ Compositional Image Retrieval Challenge

Contributors: Ali Nafisi, Hossein Shakibania

🔍 Overview

This repository contains our solution for the Compositional Image Retrieval Challenge, part of the Rayan International AI Contest. The challenge aims to develop a system capable of retrieving the most relevant image from a database by understanding a combination of visual and textual inputs.

🎯 Challenge Objective

The task requires building a system that can:

Process input consisting of:
- A reference image containing a visual scene
- A text description providing additional context or modifications
Identify and return the single most relevant matching image from a provided database.

The figure below serves as an example of this task:

⚙️ Model Constraints

📏 Maximum Model Size: 4GB
🚫 Not Allowed in Final Model for Inference:
- ❌ Large Language Models (LLMs)
- ❌ Object detection models
- ❌ Pre-trained models that directly solve the task without modifications
✅ Allowed:
- ✔️ Pre-trained Vision-Language Models (e.g., CLIP), if fine-tuned for this task

🚀 Our Approach

Our approach leverages natural language processing and vision-language models to achieve compositional retrieval in an efficient and innovative manner, while adhering to contest constraints.

1. DistilBERT Language Model for Object Identification from Query Text

We wanted to have a model that can identify the objects in the query text that will be added and removed from the query image, directly influencing the target image.

To create a dataset for our task, we utilized free versions of LLMs such as Gemini, GPT, and Claude. These models generated 635 unique templates resembling the Query Text, available in the prompt_templates.json file. Using predefined objects, we expanded these templates into 15 variations each, resulting in a dataset of 9,525 instances.

Then we fine-tuned the DistilBERT language model on the curated dataset. The model classifies tokens in the query text into three categories:

Positive (pos): Objects to be added to the query image.
Negative (neg): Objects to be removed from the query image.
Other: Articles, verbs, punctuations, or irrelevant terms.

An example output demonstrates the DistilBERT model’s capability to identify actionable tokens:

"Take out the jacket and the sketchpad; add a laptop, a watering can, and a basket."

other: take, other: out, other: the, neg: jacket, other: and, other: the, neg: sketchpad, other: ;, other: add, other: a, pos: laptop, other: ,, other: a, pos: watering, pos: can, other: ,, other: and, other: a, pos: basket, other: .

Using this classification, we generate embeddings for each positive and negative object using a template, "a photo of a <object>.". These embeddings are later used to refine the query image embedding.

2. Fine-Tuned ViTamin for Multi-Modal Feature Extraction

We finetuned a variant of the ViTamin model from OpenCLIP, a robust vision-language model, as the backbone for extracting features from both textual and visual modalities. The core innovation lies in modifying the query image embedding:

Adding embeddings of positive objects derived from the query text.
Subtracting embeddings of negative objects.

This process dynamically adjusts the query embedding to closely represent the target image's characteristics. By avoiding reliance on object detection, image captioning, or directly-used LLMs, our approach remains computationally efficient while adhering to contest constraints.

🏆 Results

Our solution for this Challenge achieved outstanding results. The evaluation metric for this challenge is Accuracy, with models tested on a private test dataset. Our model achieved the highest score, outperforming competitors with a noticeable margin.

The table below presents a summary of the Top 🔟 teams and their respective accuracy scores:

Rank	Team	Accuracy (%)
🥇	No Trust Issues Here (Our Team)	95.38
🥈	Pileh	84.61
🥉	AI Guardians of Trust	88.59
4	AIUoK	87.30
5	red_serotonin	86.90
6	GGWP	85.70
7	Persistence	85.20
8	AlphaQ	84.50
9	Tempest	83.90
10	Scientific	82.70

🏃🏻‍♂️‍➡️ Steps to Set Up and Run

Follow these instructions to set up your environment and execute the training pipeline.

1. Clone the Repository

git clone git@github.com:safinal/compositional-image-retrieval.git
cd compositional-image-retrieval

2. Set Up the Environment

We recommend using a virtual environment to manage dependencies.

Using venv:

python -m venv venv
source venv/bin/activate       # On macOS/Linux
venv\Scripts\activate          # On Windows

Using conda:

conda create --name compositional-image-retrieval python=3.8 -y
conda activate compositional-image-retrieval

3. Install Dependencies

Install all required libraries from the requirements.txt file:

pip install -r requirements.txt

4. Train the Model

DistilBert:

python run.py --model_type token_cls --config ./config/token_cls_cfg.yaml

Retrieval Model:

python run.py --model_type retrieval --config ./config/retrieval_cfg.yaml

🫶🏻 Acknowledgment

We thank the authors of DistilBERT and ViTamin and the creators of OpenCLIP for their invaluable contributions to the development of vision-language models.

🤝🏼 Contributions

We welcome contributions from the community to make this repository better!

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
assets		assets
config		config
src		src
token_classification_data		token_classification_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
problem_description.pdf		problem_description.pdf
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

⭐️ Compositional Image Retrieval Challenge

🔍 Overview

🎯 Challenge Objective

⚙️ Model Constraints

🚀 Our Approach

1. DistilBERT Language Model for Object Identification from Query Text

2. Fine-Tuned ViTamin for Multi-Modal Feature Extraction

🏆 Results

🏃🏻‍♂️‍➡️ Steps to Set Up and Run

1. Clone the Repository

2. Set Up the Environment

3. Install Dependencies

4. Train the Model

🫶🏻 Acknowledgment

🤝🏼 Contributions

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Uh oh!

License

Uh oh!

safinal/compositional-image-retrieval

Folders and files

Latest commit

History

Repository files navigation

⭐️ Compositional Image Retrieval Challenge

🔍 Overview

🎯 Challenge Objective

⚙️ Model Constraints

🚀 Our Approach

1. DistilBERT Language Model for Object Identification from Query Text

2. Fine-Tuned ViTamin for Multi-Modal Feature Extraction

🏆 Results

🏃🏻‍♂️‍➡️ Steps to Set Up and Run

1. Clone the Repository

2. Set Up the Environment

3. Install Dependencies

4. Train the Model

🫶🏻 Acknowledgment

🤝🏼 Contributions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages