diff --git a/README.md b/README.md index 225cc43cd..cef9edb8f 100644 --- a/README.md +++ b/README.md @@ -393,7 +393,7 @@ With Pyserini, it's easy to [reproduce](docs/reproducibility.md) runs on a numbe + Reproducing [BM25 baselines on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2.md) + Reproducing [DeepImpact experiments for MS MARCO (V1) Passage Ranking](docs/experiments-deepimpact.md) + Reproducing [uniCOIL experiments with doc2query-T5 expansions for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil.md) -+ Reproducing [uniCOIL experiments with TILDE document expansion for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil-tilde-expansion.md) ++ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil-tilde-expansion.md) + Reproducing [uniCOIL experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-unicoil.md) ### Dense Retrieval diff --git a/docs/experiments-unicoil-tilde-expansion.md b/docs/experiments-unicoil-tilde-expansion.md index e1cd2ad5d..2d1c078b5 100644 --- a/docs/experiments-unicoil-tilde-expansion.md +++ b/docs/experiments-unicoil-tilde-expansion.md @@ -1,20 +1,19 @@ -# Pyserini: uniCOIL for MS MARCO Passage Ranking with TILDE Passage Expansion +# Pyserini: uniCOIL (w/ TILDE) for MS MARCO Passage Ranking -This page describes how to reproduce the uniCOIL experiments in the following papers: - -> Jimmy Lin and Xueguang Ma. [A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.](https://arxiv.org/abs/2106.14807) _arXiv:2106.14807_. +This page describes how to reproduce experiments using uniCOIL with TILDE document expansion, as described in the following paper: > Shengyao Zhuang and Guido Zuccon. [Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion.](https://arxiv.org/pdf/2108.08513) _arXiv:2108.08513_. -In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. -Thus, no neural inference is involved. -For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). +The original uniCOIL model is described here: + +> Jimmy Lin and Xueguang Ma. [A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.](https://arxiv.org/abs/2106.14807) _arXiv:2106.14807_. Instead of using docTquery-T5 to perform document expansion which is slow and expensive, in this guide, the TILDE model is used to expand the corpus, resulting in a faster and cheaper document expansion process. For details of how to use TILDE to expand documents, please see [this guide](https://github.com/ielab/TILDE). -Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-unicoil-tilde-expansion.md) based on Java. -Here, we can get _exactly_ the same results from Python. +In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. +Thus, no neural inference is involved. +For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). ## Data Prep @@ -25,12 +24,12 @@ First, we need to download and extract the MS MARCO passage dataset with uniCOIL wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-tilde-expansion-b8.tar -P collections/ # Alternate mirror -wget https://vault.cs.uwaterloo.ca/s/Rm6fknT432YdBts/download -O collections/msmarco-passage-unicoil-tilde-expansion-b8.tar +wget https://vault.cs.uwaterloo.ca/s/6LECmLdiaBoPwrL/download -O collections/msmarco-passage-unicoil-tilde-expansion-b8.tar tar -xvf collections/msmarco-passage-unicoil-tilde-expansion-b8.tar -C collections/ ``` -To confirm, `msmarco-passage-unicoil-tilde-expansion-b8.tar` should have MD5 checksum of `a506ef9315c933f9d2040ce3e7385cff`. +To confirm, `msmarco-passage-unicoil-tilde-expansion-b8.tar` should have MD5 checksum of `be0a786033140ebb7a984a3e155c19ae`. ## Indexing @@ -48,18 +47,7 @@ python -m pyserini.index -collection JsonVectorCollection \ The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens. Upon completion, we should have an index with 8,841,823 documents. -The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around ten minutes. - -If you want to save time and skip the indexing step, download the prebuilt index directly: - -```bash -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8.tar.gz -P indexes/ - -# Alternate mirror -# wget https://vault.cs.uwaterloo.ca/s/bKbHmN6CjRtmoJq/download -O indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8.tar.gz - -tar -xvf indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8.tar.gz -C indexes/ -``` +The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around half an hour. ## Retrieval @@ -67,21 +55,21 @@ We can now run retrieval: ```bash python -m pyserini.search --topics msmarco-passage-dev-subset \ - --index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \ - --encoder ielab/unicoil-tilde200-msmarco-passage \ - --output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv \ - --impact \ - --hits 1000 --batch 32 --threads 12 \ - --output-format msmarco + --index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \ + --encoder ielab/unicoil-tilde200-msmarco-passage \ + --output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv \ + --impact \ + --hits 1000 --batch 32 --threads 12 \ + --output-format msmarco ``` -Query evaluation is much slower than with bag-of-words BM25; a complete run can take around 15 min. +Query evaluation is much slower than with bag-of-words BM25; a complete run can take around 20 minutes. Note that the important option here is `-impact`, where we specify impact scoring. The output is in MS MARCO output format, so we can directly evaluate: ```bash -$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage-unicoil-tilde-expansion-b8.tsv +$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv ``` The results should be as follows: @@ -94,3 +82,5 @@ QueriesRanked: 6980 ``` ## Reproduction Log[*](reproducibility.md) + ++ Results reproduced by [@lintool](https://github.com/lintool) on 2021-09-08 (commit [`f026b87`](https://github.com/castorini/pyserini/commit/f026b871e0e581743fcb09d1eb309e9698767a8d)) diff --git a/docs/experiments-unicoil.md b/docs/experiments-unicoil.md index a724583c5..7977903ec 100644 --- a/docs/experiments-unicoil.md +++ b/docs/experiments-unicoil.md @@ -1,4 +1,4 @@ -# Pyserini: uniCOIL for MS MARCO Passage Ranking +# Pyserini: uniCOIL (w/ doc2query-T5) for MS MARCO Passage Ranking This page describes how to reproduce the uniCOIL experiments in the following paper: @@ -43,7 +43,7 @@ python -m pyserini.index -collection JsonVectorCollection \ The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens. Upon completion, we should have an index with 8,841,823 documents. -The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around ten minutes. +The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 20 minutes. ## Retrieval @@ -71,7 +71,7 @@ $ python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subs --output-format msmarco ``` -Query evaluation is much slower than with bag-of-words BM25; a complete run can take around 15 min. +Query evaluation is much slower than with bag-of-words BM25; a complete run can take around 15 minutes. Note that the important option here is `-impact`, where we specify impact scoring. The output is in MS MARCO output format, so we can directly evaluate: