Skip to content

Latest commit

 

History

History
125 lines (89 loc) · 5.26 KB

experiments-msmarco-passage.md

File metadata and controls

125 lines (89 loc) · 5.26 KB

Pyserini: BM25 Baseline for MS MARCO Passage Retrieval

This guide contains instructions for running BM25 baselines on the MS MARCO passage ranking task, which is nearly identical to a similar guide in Anserini, except that everything is in Python here (no Java). Note that there is a separate guide for the MS MARCO document ranking task.

Data Prep

We're going to use collections/msmarco-passage/ as the working directory. First, we need to download and extract the MS MARCO passage dataset:

mkdir collections/msmarco-passage

wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

To confirm, collectionandqueries.tar.gz should have MD5 checksum of 31644046b18952c1386cd4564ba2ae69.

Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line):

python tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl

The above script should generate 9 jsonl files in collections/msmarco-passage/collection_jsonl, each with 1M lines (except for the last one, which should have 841,823 lines).

We can now index these docs as a JsonCollection using Anserini:

python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
 -threads 9 -input collections/msmarco-passage/collection_jsonl \
 -index indexes/msmarco-passage/lucene-index-msmarco -storePositions -storeDocvectors -storeRaw

Note that the indexing program simply dispatches command-line arguments to an underlying Java program, and so we use the Java single dash convention, e.g., -index and not --index.

Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD, indexing takes a couple of minutes.

Performing Retrieval on the Dev Queries

Since queries of the set are too many (+100k), it would take a long time to retrieve all of them. To speed this up, we use only the queries that are in the qrels file:

python tools/scripts/msmarco/filter_queries.py \
 --qrels collections/msmarco-passage/qrels.dev.small.tsv \
 --queries collections/msmarco-passage/queries.dev.tsv \
 --output collections/msmarco-passage/queries.dev.small.tsv

The output queries file should contain 6980 lines. We can now perform a retrieval run using this smaller set of queries:

python tools/scripts/msmarco/retrieve.py --hits 1000 --threads 1 \
 --index indexes/msmarco-passage/lucene-index-msmarco \
 --queries collections/msmarco-passage/queries.dev.small.tsv \
 --output runs/run.msmarco-passage.dev.small.tsv

Note that by default, the above script uses BM25 with tuned parameters k1=0.82, b=0.68. The option --hits specifies the of documents per query to be retrieved. Thus, the output file should have approximately 6980 × 1000 = 6.9M lines.

Retrieval speed will vary by machine: On a modern desktop with an SSD, we can get ~0.07 s/query, so the run should finish in under ten minutes. We can perform multi-threaded retrieval by changing the --threads argument.

Evaluating the Results

Finally, we can evaluate the retrieved documents using this the official MS MARCO evaluation script:

python tools/scripts/msmarco/msmarco_eval.py \
 collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage.dev.small.tsv

And the output should be like this:

#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################

We can also use the official TREC evaluation tool, trec_eval, to compute other metrics than MRR@10. For that we first need to convert runs and qrels files to the TREC format:

python tools/scripts/msmarco/convert_msmarco_to_trec_run.py \
 --input runs/run.msmarco-passage.dev.small.tsv \
 --output runs/run.msmarco-passage.dev.small.trec

python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
 --input collections/msmarco-passage/qrels.dev.small.tsv \
 --output collections/msmarco-passage/qrels.dev.small.trec

And run the trec_eval tool:

tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
 collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.dev.small.trec

The output should be:

map                   	all	0.1957
recall_1000           	all	0.8573

Average precision or AP (also called mean average precision, MAP) and recall@1000 (recall at rank 1000) are the two metrics we care about the most. AP captures aspects of both precision and recall in a single metric, and is the most common metric used by information retrieval researchers. On the other hand, recall@1000 provides the upper bound effectiveness of downstream reranking modules (i.e., rerankers are useless if there isn't a relevant document in the results).

Replication Log