Skip to content

lxucs/CapRetrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CapRetrieval

This repository contains the dataset and evaluation script for CapRetrieval, introduced in the paper Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings.

The dataset is also available on Huggingface.

Dataset

CapRetrieval evaluates the fine-grained embedding matching, tailored towards a practical image search scenario in Chinese via dense passage retrieval:

  • Candidate passages are image captions, and queries are short phrases of entities or events reflected in captions.
  • Overall, the dataset comprises seemingly simple queries and captions; however, text encoders are shown limitations resolving these cases.
  • Evaluation results call for attention on embedding training strategies with different granularity.

Format

CapRetrieval follows the same retrieval task format as in MTEB, with relevance labels in $[0,1,2]$ for each pair. Note that unlike prior datasets, we annotate full labels for each query-passage pair (1.3 million pairs), minimizing false negatives for more accurate evaluation.

A small amount of queries do not have any relevant captions; they are excluded in computation of retrieval metrics (e.g. nDCG), but can be useful for other analysis, e.g. in classification setting.

Evaluation Script

run.py is a general script to evaluate embedding retrieval of various encoders.

Results and embeddings will be saved under a new evaluation directory.

Environment

Install pytorch according to your local environment, then pip install -r requirements.txt

Usage

See options by python run.py --help

The script by default automatically uses the most appropriate device; you can also set device_map explicitly. Embeddings will be cached and re-used.

Current Options
options:
  -h, --help            show this help message and exit
  --dataset DATASET     Dataset name
  --lang {en,zh}        Dataset language (for BM25)
  --mode {dense,bm25}   Search mode
  --model MODEL         HF model name or path
  --device_map DEVICE_MAP
                        Set model device map explicitly
  --max_len MAX_LEN     Max seq length
  --pooling {cls,mean,last,use_sentence_transformer}
                        Encoder pooling style
  --disable_normalization
                        Disable embedding normalization
  --query_template QUERY_TEMPLATE
                        Prompt template for query
  --candidate_template CANDIDATE_TEMPLATE
                        Prompt template for candidate
  --padding_side {left,right}
                        Tokenizer padding side
  --threshold THRESHOLD
                        Use results under distance threshold for evaluation
  --topk TOPK           Use top k results for evaluation
  --batch_size BATCH_SIZE
                        Eval batch size
  --result_path RESULT_PATH
                        Compute metrics of existing results directly
Output Example
Search: 100%|██████████| 404/404 [00:00<00:00, 5315.29it/s]
Metrics for dataset CapRetrieval:
Query evaluation: reciprocal_rank @top10 = 88.70
Query evaluation: average_precision @top10 = 82.91
Query evaluation: ndcg @top10 = 78.86
Query evaluation: hit @top10 = 92.08
Query evaluation: query_precision = 38.22
Query evaluation: query_recall = 68.71
Query evaluation: pair_precision = 38.22
Query evaluation: pair_recall = 32.97
Query evaluation: query_f1 = 49.12
Query evaluation: query_f2 = 59.25
Query evaluation: pair_f1 = 35.40
Query evaluation: pair_f2 = 33.90

Saved 404 query results to evaluation/results.CapRetrieval.bge-base-zh-v1.5.top10.json
Saved report to evaluation/report.CapRetrieval.bge-base-zh-v1.5.top10.json

Usage Examples

Evaluate BM25 (for now only support language zh):

  • python run.py --dataset CapRetrieval --topk 10 --mode bm25 --lang zh

Evaluate BGE encoders using CLS pooling (default pooling):

  • python run.py --dataset CapRetrieval --topk 10 --model BAAI/bge-base-zh-v1.5

Evaluate GTE multilingual model using CLS pooling:

  • python run.py --dataset CapRetrieval --topk 10 --model Alibaba-NLP/gte-multilingual-base

Evaluate Conan-v1 encoder using default SentenceTransformers setup:

  • python run.py --dataset CapRetrieval --topk 10 --model TencentBAC/Conan-embedding-v1 --pooling use_sentence_transformer

Evaluate E5 encoders using mean pooling, with suggested prompt templates:

  • python run.py --dataset CapRetrieval --topk 10 --model intfloat/multilingual-e5-base --pooling mean --max_len 512 --query_template "query: {text}" --candidate_template "passage: {text}"

Evaluate GTE-Qwen encoders using last token pooling, with according prompt templates:

  • python run.py --dataset CapRetrieval --topk 10 --model Alibaba-NLP/gte-Qwen2-7B-instruct --pooling last --query_template "Instruct: Given an image search query, retrieve relevant image captions\nQuery: {text}" --batch_size 8

Evaluate Qwen3 embedding models using last token pooling, with according prompt templates:

  • python run.py --dataset CapRetrieval --topk 10 --model Qwen/Qwen3-Embedding-8B --padding_side left --pooling last --query_template "Instruct: Given an image search query, retrieve relevant image captions\nQuery: {text}" --batch_size 8

Evaluation Scores

Type Model nDCG@10
BM25 Basic BM25 66.54
0.1B bge-base-zh-v1.5 78.86
gte-multilingual-base 79.67
multilingual-e5-base 76.33
0.3B bge-large-zh-v1.5 79.15
multilingual-e5-large 81.01
Conan-embedding-v1 77.04
0.6B Qwen3-Embedding-0.6B 81.04
>1B gte-Qwen2-1.5B-instruct 77.35
gte-Qwen2-7B-instruct 86.55
e5-mistral-7b-instruct 76.40
Qwen3-Embedding-8B 84.61
Trained Out-of-Domain 87.23
In-Domain 91.83

The trained models (based on bge-base-zh-v1.5) are trained with queries by our data generation strategies described in the paper. The in-domain model can be downloaded from Google Drive.

License Agreement

The dataset and trained models are licensed under Apache 2.0.

Citation

@misc{xu2025denseretrieversfailsimple,
      title={Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings}, 
      author={Liyan Xu and Zhenlin Su and Mo Yu and Jiangnan Li and Fandong Meng and Jie Zhou},
      year={2025},
      eprint={2506.08592},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.08592}, 
}

About

Dataset and evaluation script for CapRetrieval.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages