This repository contains the dataset and evaluation script for CapRetrieval, introduced in the paper Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings.
The dataset is also available on Huggingface.
CapRetrieval evaluates the fine-grained embedding matching, tailored towards a practical image search scenario in Chinese via dense passage retrieval:
- Candidate passages are image captions, and queries are short phrases of entities or events reflected in captions.
- Overall, the dataset comprises seemingly simple queries and captions; however, text encoders are shown limitations resolving these cases.
- Evaluation results call for attention on embedding training strategies with different granularity.
CapRetrieval follows the same retrieval task format as in MTEB, with relevance labels in
A small amount of queries do not have any relevant captions; they are excluded in computation of retrieval metrics (e.g. nDCG), but can be useful for other analysis, e.g. in classification setting.
run.py is a general script to evaluate embedding retrieval of various encoders.
Results and embeddings will be saved under a new evaluation
directory.
Install pytorch
according to your local environment, then pip install -r requirements.txt
See options by python run.py --help
The script by default automatically uses the most appropriate device; you can also set device_map
explicitly. Embeddings will be cached and re-used.
Current Options
options:
-h, --help show this help message and exit
--dataset DATASET Dataset name
--lang {en,zh} Dataset language (for BM25)
--mode {dense,bm25} Search mode
--model MODEL HF model name or path
--device_map DEVICE_MAP
Set model device map explicitly
--max_len MAX_LEN Max seq length
--pooling {cls,mean,last,use_sentence_transformer}
Encoder pooling style
--disable_normalization
Disable embedding normalization
--query_template QUERY_TEMPLATE
Prompt template for query
--candidate_template CANDIDATE_TEMPLATE
Prompt template for candidate
--padding_side {left,right}
Tokenizer padding side
--threshold THRESHOLD
Use results under distance threshold for evaluation
--topk TOPK Use top k results for evaluation
--batch_size BATCH_SIZE
Eval batch size
--result_path RESULT_PATH
Compute metrics of existing results directly
Output Example
Search: 100%|██████████| 404/404 [00:00<00:00, 5315.29it/s]
Metrics for dataset CapRetrieval:
Query evaluation: reciprocal_rank @top10 = 88.70
Query evaluation: average_precision @top10 = 82.91
Query evaluation: ndcg @top10 = 78.86
Query evaluation: hit @top10 = 92.08
Query evaluation: query_precision = 38.22
Query evaluation: query_recall = 68.71
Query evaluation: pair_precision = 38.22
Query evaluation: pair_recall = 32.97
Query evaluation: query_f1 = 49.12
Query evaluation: query_f2 = 59.25
Query evaluation: pair_f1 = 35.40
Query evaluation: pair_f2 = 33.90
Saved 404 query results to evaluation/results.CapRetrieval.bge-base-zh-v1.5.top10.json
Saved report to evaluation/report.CapRetrieval.bge-base-zh-v1.5.top10.json
Evaluate BM25 (for now only support language zh):
python run.py --dataset CapRetrieval --topk 10 --mode bm25 --lang zh
Evaluate BGE encoders using CLS pooling (default pooling):
python run.py --dataset CapRetrieval --topk 10 --model BAAI/bge-base-zh-v1.5
Evaluate GTE multilingual model using CLS pooling:
python run.py --dataset CapRetrieval --topk 10 --model Alibaba-NLP/gte-multilingual-base
Evaluate Conan-v1 encoder using default SentenceTransformers setup:
python run.py --dataset CapRetrieval --topk 10 --model TencentBAC/Conan-embedding-v1 --pooling use_sentence_transformer
Evaluate E5 encoders using mean pooling, with suggested prompt templates:
python run.py --dataset CapRetrieval --topk 10 --model intfloat/multilingual-e5-base --pooling mean --max_len 512 --query_template "query: {text}" --candidate_template "passage: {text}"
Evaluate GTE-Qwen encoders using last token pooling, with according prompt templates:
python run.py --dataset CapRetrieval --topk 10 --model Alibaba-NLP/gte-Qwen2-7B-instruct --pooling last --query_template "Instruct: Given an image search query, retrieve relevant image captions\nQuery: {text}" --batch_size 8
Evaluate Qwen3 embedding models using last token pooling, with according prompt templates:
python run.py --dataset CapRetrieval --topk 10 --model Qwen/Qwen3-Embedding-8B --padding_side left --pooling last --query_template "Instruct: Given an image search query, retrieve relevant image captions\nQuery: {text}" --batch_size 8
Type | Model | nDCG@10 |
---|---|---|
BM25 | Basic BM25 | 66.54 |
0.1B | bge-base-zh-v1.5 | 78.86 |
gte-multilingual-base | 79.67 | |
multilingual-e5-base | 76.33 | |
0.3B | bge-large-zh-v1.5 | 79.15 |
multilingual-e5-large | 81.01 | |
Conan-embedding-v1 | 77.04 | |
0.6B | Qwen3-Embedding-0.6B | 81.04 |
>1B | gte-Qwen2-1.5B-instruct | 77.35 |
gte-Qwen2-7B-instruct | 86.55 | |
e5-mistral-7b-instruct | 76.40 | |
Qwen3-Embedding-8B | 84.61 | |
Trained | Out-of-Domain | 87.23 |
In-Domain | 91.83 |
The trained models (based on bge-base-zh-v1.5
) are trained with queries by our data generation strategies described in the paper. The in-domain model can be downloaded from Google Drive.
The dataset and trained models are licensed under Apache 2.0.
@misc{xu2025denseretrieversfailsimple,
title={Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings},
author={Liyan Xu and Zhenlin Su and Mo Yu and Jiangnan Li and Fandong Meng and Jie Zhou},
year={2025},
eprint={2506.08592},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.08592},
}