python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
or with uv
uv sync
source .venv/bin/activate
Download the relevant datasets from the following links.
To build the textual databases to compute image-to-text and text-to-text retrieval, look at the following files.
For CC12M look at CC12M preprocessing SigLIP
For COYO700M look at COYO preprocessing SigLIP
For CC12M look at CC12M preprocessing Mistral
For COYO700M look at COYO preprocessing Mistral
Then look at the awesome guide by (Alessandro Conti)[https://github.com/altndrr/vic/issues/8#issuecomment-1594732756\] to create the fast FAISS index with the embeddings.
We will shortly release our pre-computed databases for the two datasets above.
You just need to create 3 csv files like those you can find in the artifacts folder, specifying the index to class name, the "train" samples and the test samples.
Run Image-to-text retrieval specifying the dataset, the split of the dataset, and which database you want to use (e.g. coyo_siglip
or cc12m_siglip
). The resulting embeddings will be stored in retrieved_embeddings
.
First run Text-to-text retrieval, specifyint the dataset, the database to use, and the embedding model. If you want to specify a custom dataset with custom retrieval prompt, modify dataset_data
in this file. The resulting embeddings will be stored in results
.
Once you obtain the embeddings, you can build the retrieved zero-shot weights with different temperatures by running the create zero-shot script by specifying which model has been used (siglip/mistral), the dataset, and the suffix indicating whether you used the common
object names (for Circuits and HAM10000) or the premerged
(merged common/scientific names) for iNaturalist.
Lastly, you can use the image-to-text retrieval and the text-to-text retrieval to run the enriched zero-shot predictions using this script.
You need to specify: the dataset, the database used for image-to-text retrieval (cc12m/coyo), the dataset split (train/test), the alpha and beta values to test for the merging (can be floats or lists) and the temperature for the image-to-text retrieval weight distribution computation.
For the text-to-text retrieval part, look at line 17 onwards to load your saved retrieved zero-shot weights, you have to specify the temperature used to extract those (WIP: infer from the file name).
The output will be the parameters that led to the best Acc@1, if you run the script on the train split then you can use the output parameters on your test split.
If you find our research useful, please cite us as:
@inproceedings{dallasen2024retrieval,
title = "Retrieval-enriched zero-shot image classification in low-resource domains",
author = "Dall{'}Asen, Nicola and
Wang, Yiming and
Fini, Enrico and
Ricci, Elisa",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.1186/",
doi = "10.18653/v1/2024.emnlp-main.1186",
}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.