Skip to content

Latest commit

 

History

History
 
 

MLDR

MultiLongDocRetrieval

MultiLongDocRetrieval (denoted as MLDR) is a multilingual long-document retrieval dataset. For more details, please refer to Shitao/MLDR.

Dense Retrieval

This task has been merged into MTEB, you can easily use mteb tool to do evaluation.

We also provide a script, you can use it following this command:

cd mteb_dense_eval

# Print and Save Evaluation Results with MTEB
python eval_MLDR.py \
--encoder BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--results_save_path ./results \
--max_query_length 512 \
--max_passage_length 8192 \
--batch_size 256 \
--corpus_batch_size 1 \
--pooling_method cls \
--normalize_embeddings True \
--add_instruction False \
--overwrite False

There are some important parameters:

  • encoder: Name or path of the model to evaluate.

  • languages: The languages you want to evaluate on. Avaliable languages: ar de en es fr hi it ja ko pt ru th zh.

  • max_query_length & max_passage_length: Maximum query length and maximum passage length when encoding.

  • batch_size & corpus_batch_size: Batch size for query and corpus when encoding. If max_query_length == max_passage_length, you can ignore the corpus_batch_size parameter and only set batch_size for convenience. For faster evaluation, you should set the batch_size and corpus_batch_size as large as possible.

  • pooling_method & normalize_embeddings: You should follow the corresponding setting of the model you are evaluating. For example, BAAI/bge-m3 is cls and True, intfloat/multilingual-e5-large is mean and True, and intfloat/e5-mistral-7b-instruct is last and True.

  • add_instruction: Whether to add instruction for query or passage when evaluating. If set add_instruction=True, you should also set the following parameters appropriately:

    • query_instruction_for_retrieval: the query instruction for retrieval
    • passage_instruction_for_retrieval: the passage instruction for retrieval

    If you only add query instruction, just ignore the passage_instruction_for_retrieval parameter.

  • overwrite: Whether to overwrite evaluation results.

Hybrid Retrieval (Dense & Sparse)

If you want to perform hybrid retrieval with both dense and sparse methods, you can follow the following steps:

  1. Install Java, Pyserini and Faiss (CPU version or GPU version):
# install java (Linux)
apt update
apt install openjdk-11-jdk

# install pyserini
pip install pyserini

# install faiss
## CPU version
conda install -c conda-forge faiss-cpu

## GPU version
conda install -c conda-forge faiss-gpu
  1. Download qrels from Shitao/MLDR:
mkdir -p qrels
cd qrels

splits=(dev test)
langs=(ar de en es fr hi it ja ko pt ru th zh)
for split in ${splits[*]}; do for lang in ${langs[*]}; do wget "https://huggingface.co/datasets/Shitao/MLDR/resolve/main/qrels/qrels.mldr-v1.0-${lang}-${split}.tsv"; done; done;
  1. Dense retrieval:
cd dense_retrieval

# 1. Generate Corpus Embedding
python step0-generate_embedding.py \
--encoder BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--index_save_dir ./corpus-index \
--max_passage_length 8192 \
--batch_size 4 \
--fp16 \
--pooling_method cls \
--normalize_embeddings True \
--add_instruction False

# 2. Search Results
python step1-search_results.py \
--encoder BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--index_save_dir ./corpus-index \
--result_save_dir ./search_results \
--threads 16 \
--hits 1000 \
--pooling_method cls \
--normalize_embeddings True \
--add_instruction False

# 3. Print and Save Evaluation Results
python step2-eval_dense_mldr.py \
--encoder BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--search_result_save_dir ./search_results \
--qrels_dir ../qrels \
--eval_result_save_dir ./eval_results \
--metrics ndcg@10 \
--pooling_method cls \
--normalize_embeddings True

Note: The evaluation results of this method may have slight differences compared to results of the method mentioned earlier (with MTEB), which is considered normal.

  1. Sparse Retrieval
cd sparse_retrieval

# 1. Generate Query and Corpus Sparse Vector
python step0-encode_query-and-corpus.py \
--encoder BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--save_dir ./encoded_query-and-corpus \
--max_query_length 512 \
--max_passage_length 8192 \
--batch_size 1024 \
--corpus_batch_size 4 \
--pooling_method cls \
--normalize_embeddings True

# 2. Output Search Results
python step1-search_results.py \
--encoder BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--encoded_query_and_corpus_save_dir ./encoded_query-and-corpus \
--result_save_dir ./search_results \
--threads 16 \
--hits 1000

# 3. Print and Save Evaluation Results
python step2-eval_sparse_mldr.py \
--encoder BAAI/bge-m3 \
--languages ar de es fr hi it ja ko pt ru th en zh \
--search_result_save_dir ./search_results \
--qrels_dir ../qrels \
--eval_result_save_dir ./eval_results \
--metrics ndcg@10 \
--pooling_method cls \
--normalize_embeddings True
  1. Hybrid Retrieval
cd hybrid_retrieval

# 1. Search Dense and Sparse Results
Dense Retrieval
Sparse Retrieval

# 2. Hybrid Dense and Sparse Search Results
python step0-hybrid_search_results.py \
--model_name_or_path BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--dense_search_result_save_dir ../dense_retrieval/search_results \
--sparse_search_result_save_dir ../sparse_retrieval/search_results \
--hybrid_result_save_dir ./search_results \
--top_k 1000 \
--dense_weight 0.2 --sparse_weight 0.8

# 3. Print and Save Evaluation Results
python step1-eval_hybrid_mldr.py \
--model_name_or_path BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--search_result_save_dir ./search_results \
--qrels_dir ../qrels \
--eval_result_save_dir ./eval_results \
--metrics ndcg@10 \
--pooling_method cls \
--normalize_embeddings True

MultiVector and All Rerank

If you want to perform multi-vector reranking or all reranking based on the search results of dense retrieval, you can follow the following steps:

  1. Install Java, Pyserini and Faiss (CPU version or GPU version):
# install java (Linux)
apt update
apt install openjdk-11-jdk

# install pyserini
pip install pyserini

# install faiss
## CPU version
conda install -c conda-forge faiss-cpu

## GPU version
conda install -c conda-forge faiss-gpu
  1. Download qrels from Shitao/MLDR:
mkdir -p qrels
cd qrels

splits=(dev test)
langs=(ar de en es fr hi it ja ko pt ru th zh)
for split in ${splits[*]}; do for lang in ${langs[*]}; do wget "https://huggingface.co/datasets/Shitao/MLDR/resolve/main/qrels/qrels.mldr-v1.0-${lang}-${split}.tsv"; done; done;
  1. Dense retrieval:
cd dense_retrieval

# 1. Generate Corpus Embedding
python step0-generate_embedding.py \
--encoder BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--index_save_dir ./corpus-index \
--max_passage_length 8192 \
--batch_size 4 \
--fp16 \
--pooling_method cls \
--normalize_embeddings True \
--add_instruction False

# 2. Search Results
python step1-search_results.py \
--encoder BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--index_save_dir ./corpus-index \
--result_save_dir ./search_results \
--threads 16 \
--hits 1000 \
--pooling_method cls \
--normalize_embeddings True \
--add_instruction False

# 3. Print and Save Evaluation Results
python step2-eval_dense_mldr.py \
--encoder BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--search_result_save_dir ./search_results \
--qrels_dir ../qrels \
--eval_result_save_dir ./eval_results \
--metrics ndcg@10 \
--pooling_method cls \
--normalize_embeddings True

Note: The evaluation results of this method may have slight differences compared to results of the method mentioned earlier (with MTEB), which is considered normal.

  1. Rerank search results with multi-vector scores or all scores:
cd multi_vector_rerank

# 1. Rerank Search Results
python step0-rerank_results.py \
--encoder BAAI/bge-m3 \
--reranker BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--search_result_save_dir ../dense_retrieval/search_results \
--rerank_result_save_dir ./rerank_results \
--top_k 200 \
--batch_size 4 \
--max_query_length 512 \
--max_passage_length 8192 \
--pooling_method cls \
--normalize_embeddings True \
--dense_weight 0.15 --sparse_weight 0.5 --colbert_weight 0.35 \
--num_shards 1 --shard_id 0 --cuda_id 0

# 2. Print and Save Evaluation Results
python step1-eval_rerank_mldr.py \
--encoder BAAI/bge-m3 \
--reranker BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--search_result_save_dir ./rerank_results \
--qrels_dir ../qrels \
--eval_result_save_dir ./eval_results \
--metrics ndcg@10

Note:

  • You should set dense_weight, sparse_weight and colbert_weight based on the downstream task scenario. If the dense method performs well while the sparse method does not, you can lower sparse_weight and increase dense_weight accordingly.

  • Based on our experience, dividing the sentence pairs to be reranked into several shards and computing scores for each shard on a single GPU tends to be more efficient than using multiple GPUs to compute scores for all sentence pairs directly.Therefore, if your machine have multiple GPUs, you can set num_shards to the number of GPUs and launch multiple terminals to execute the command (shard_id should be equal to cuda_id). Therefore, if you have multiple GPUs on your machine, you can launch multiple terminals and run multiple commands simultaneously. Make sure to set the shard_id and cuda_id appropriately, and ensure that you have computed scores for all shards before proceeding to the second step.

  1. (Optional) In the 4th step, you can get all three kinds of scores, saved to rerank_result_save_dir/dense/{encoder}-{reranker}, rerank_result_save_dir/sparse/{encoder}-{reranker} and rerank_result_save_dir/colbert/{encoder}-{reranker}. If you want to try other weights, you don't need to rerun the 4th step. Instead, you can use this script to hybrid the three kinds of scores directly.
cd multi_vector_rerank

# 1. Hybrid All Search Results
python hybrid_all_results.py \
--encoder BAAI/bge-m3 \
--reranker BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--dense_search_result_save_dir ./rerank_results/dense \
--sparse_search_result_save_dir ./rerank_results/sparse \
--colbert_search_result_save_dir ./rerank_results/colbert \
--hybrid_result_save_dir ./hybrid_search_results \
--top_k 200 \
--dense_weight 0.2 --sparse_weight 0.4 --colbert_weight 0.4

# 2. Print and Save Evaluation Results
python step1-eval_rerank_mldr.py \
--encoder BAAI/bge-m3 \
--reranker BAAI/bge-m3 \
--languages ar de en es fr hi it ja ko pt ru th zh \
--search_result_save_dir ./hybrid_search_results \
--qrels_dir ../qrels \
--eval_result_save_dir ./eval_hybrid_results \
--metrics ndcg@10

BM25 Baseline

We provide two methods of evaluating BM25 baseline:

  1. Use the same tokenizer with BAAI/bge-m3 (i.e., tokenizer of XLM-Roberta):
cd sparse_retrieval

# 1. Output Search Results with BM25 (same)
python bm25_baseline_same_tokenizer.py

# 2. Print and Save Evaluation Results
python step2-eval_sparse_mldr.py \
--encoder bm25_same_tokenizer \
--languages ar de es fr hi it ja ko pt ru th en zh \
--search_result_save_dir ./search_results \
--qrels_dir ../qrels \
--eval_result_save_dir ./eval_results \
--metrics ndcg@10
  1. Use the language analyzer provided by Anserini (Lucene Tokenizer):
cd sparse_retrieval

# 1. Output Search Results with BM25
python bm25_baseline.py

# 2. Print and Save Evaluation Results
python step2-eval_sparse_mldr.py \
--encoder bm25 \
--languages ar de es fr hi it ja ko pt ru th en zh \
--search_result_save_dir ./search_results \
--qrels_dir ../qrels \
--eval_result_save_dir ./eval_results \
--metrics ndcg@10