diff --git a/README.md b/README.md index 09292558c6..7de1872fd8 100644 --- a/README.md +++ b/README.md @@ -57,18 +57,18 @@ See individual pages for details! |---|:---:|:----:|:----:| | **Unsupervised Lexical** | | BoW baselines | [+](docs/regressions-msmarco-passage.md) | [+](docs/regressions-dl19-passage.md) | [+](docs/regressions-dl20-passage.md) | -| Quantized BM25 | [+](docs/regressions-msmarco-passage-bm25-b8.md) | [+](docs/regressions-dl19-passage-bm25-b8.md) | [+](docs/regressions-dl20-passage-bm25-b8.md) | +| Quantized BM25 | [✓](docs/regressions-msmarco-passage-bm25-b8.md) | [✓](docs/regressions-dl19-passage-bm25-b8.md) | [✓](docs/regressions-dl20-passage-bm25-b8.md) | | WP baselines | [+](docs/regressions-msmarco-passage-wp.md) | [+](docs/regressions-dl19-passage-wp.md) | [+](docs/regressions-dl20-passage-wp.md) | | doc2query | [+](docs/regressions-msmarco-passage-doc2query.md) | | doc2query-T5 | [+](docs/regressions-msmarco-passage-docTTTTTquery.md) | [+](docs/regressions-dl19-passage-docTTTTTquery.md) | [+](docs/regressions-dl20-passage-docTTTTTquery.md) | | **Learned sparse lexical (uniCOIL family)** | | uniCOIL noexp | [✓](docs/regressions-msmarco-passage-unicoil-noexp.md) | [✓](docs/regressions-dl19-passage-unicoil-noexp.md) | [✓](docs/regressions-dl20-passage-unicoil-noexp.md) | | uniCOIL with doc2query-T5 | [✓](docs/regressions-msmarco-passage-unicoil.md) | [✓](docs/regressions-dl19-passage-unicoil.md) | [✓](docs/regressions-dl20-passage-unicoil.md) | -| uniCOIL with TILDE | [+](docs/regressions-msmarco-passage-unicoil-tilde-expansion.md) | +| uniCOIL with TILDE | [✓](docs/regressions-msmarco-passage-unicoil-tilde-expansion.md) | | **Learned sparse lexical (other)** | -| DeepImpact | [+](docs/regressions-msmarco-passage-deepimpact.md) | -| SPLADEv2 | [+](docs/regressions-msmarco-passage-distill-splade-max.md) | -| SPLADE-distill CoCodenser-medium | [+](docs/regressions-msmarco-passage-splade-distil-cocodenser-medium.md) | [+](docs/regressions-dl19-passage-splade-distil-cocodenser-medium.md) | [+](docs/regressions-dl20-passage-splade-distil-cocodenser-medium.md) | +| DeepImpact | [✓](docs/regressions-msmarco-passage-deepimpact.md) | +| SPLADEv2 | [✓](docs/regressions-msmarco-passage-distill-splade-max.md) | +| SPLADE-distill CoCodenser-medium | [✓](docs/regressions-msmarco-passage-splade-distil-cocodenser-medium.md) | [✓](docs/regressions-dl19-passage-splade-distil-cocodenser-medium.md) | [✓](docs/regressions-dl20-passage-splade-distil-cocodenser-medium.md) | ### MS MARCO V1 Document Corpus @@ -170,16 +170,21 @@ See individual pages for details! ### Available Corpora -| Corpora | Size | Checksum | -|:--------|-----:|:---------| -| [MS MARCO V1 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-noexp.tar) | 2.7 GB | `f17ddd8c7c00ff121c3c3b147d2e17d8` | -| [MS MARCO V1 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil.tar) | 3.4 GB | `78eef752c78c8691f7d61600ceed306f` | -| [MS MARCO V1 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar) | 11 GB | `11b226e1cacd9c8ae0a660fd14cdd710` | -| [MS MARCO V1 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar) | 19 GB | `6a00e2c0c375cb1e52c83ae5ac377ebb` | -| [MS MARCO V2 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_noexp_0shot.tar) | 24 GB | `d9cc1ed3049746e68a2c91bf90e5212d` | -| [MS MARCO V2 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_0shot.tar) | 41 GB | `1949a00bfd5e1f1a230a04bbc1f01539` | -| [MS MARCO V2 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar) | 55 GB | `97ba262c497164de1054f357caea0c63` | -| [MS MARCO V2 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar) | 72 GB | `c5639748c2cbad0152e10b0ebde3b804` | +| Corpora | Size | Checksum | +|:------------------------------------------------------------------------------------------------------------------------------------------------|-------:|:-----------------------------------| +| [MS MARCO V1 passage: Quantized BM25](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar) | 1.2 GB | `0a623e2c97ac6b7e814bf1323a97b435` | +| [MS MARCO V1 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-noexp.tar) | 2.7 GB | `f17ddd8c7c00ff121c3c3b147d2e17d8` | +| [MS MARCO V1 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil.tar) | 3.4 GB | `78eef752c78c8691f7d61600ceed306f` | +| [MS MARCO V1 passage: uniCOIL (TILDE)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-tilde-expansion.tar) | 3.9 GB | `12a9c289d94e32fd63a7d39c9677d75c` | +| [MS MARCO V1 passage: DeepImpact](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact.tar) | 3.6 GB | `73843885b503af3c8b3ee62e5f5a9900` | +| [MS MARCO V1 passage: SPLADEv2](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar) | 9.9 GB | `b5d126f5d9a8e1b3ef3f5cb0ba651725` | +| [MS MARCO V1 passage: SPLADE CoCodenser](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar) | 4.9 GB | `f77239a26d08856e6491a34062893b0c` | +| [MS MARCO V1 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar) | 11 GB | `11b226e1cacd9c8ae0a660fd14cdd710` | +| [MS MARCO V1 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar) | 19 GB | `6a00e2c0c375cb1e52c83ae5ac377ebb` | +| [MS MARCO V2 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_noexp_0shot.tar) | 24 GB | `d9cc1ed3049746e68a2c91bf90e5212d` | +| [MS MARCO V2 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_0shot.tar) | 41 GB | `1949a00bfd5e1f1a230a04bbc1f01539` | +| [MS MARCO V2 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar) | 55 GB | `97ba262c497164de1054f357caea0c63` | +| [MS MARCO V2 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar) | 72 GB | `c5639748c2cbad0152e10b0ebde3b804` | ## Additional Documentation diff --git a/docs/regressions-dl19-passage-bm25-b8.md b/docs/regressions-dl19-passage-bm25-b8.md index 0123d5ee45..51d27c373c 100644 --- a/docs/regressions-dl19-passage-bm25-b8.md +++ b/docs/regressions-dl19-passage-bm25-b8.md @@ -12,25 +12,50 @@ Note that this page is automatically generated from [this template](../src/main/ From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-bm25-b8 ``` +From any machine, the following command will download the corpus (as quantized BM25 weights) and perform the complete regression, end to end: + +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression dl19-passage-bm25-b8 +``` + +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar -P collections/ +tar xvf collections/msmarco-passage-bm25-b8.tar -C collections/ +``` + +To confirm, `msmarco-passage-bm25-b8.tar` is 1.2 GB and has MD5 checksum `0a623e2c97ac6b7e814bf1323a97b435`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash +python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-bm25-b8 \ + --corpus-path collections/msmarco-passage-bm25-b8 +``` + ## Indexing Typical indexing command: -``` +```bash target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ - -input /path/to/msmarco-passage \ + -input /path/to/msmarco-passage-bm25-b8 \ -index indexes/lucene-index.msmarco-passage-bm25-b8/ \ -generator DefaultLuceneDocumentGenerator \ -threads 9 -impact -pretokenized \ - >& logs/log.msmarco-passage & + >& logs/log.msmarco-passage-bm25-b8 & ``` -The directory `/path/to/msmarco-passage/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document +The directory `/path/to/msmarco-passage-bm25-b8/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -42,22 +67,22 @@ The original data can be found [here](https://trec.nist.gov/data/deep2019.html). After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-bm25-b8/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt \ -topicreader TsvInt \ - -output runs/run.msmarco-passage.bm25-b8.topics.dl19-passage.txt \ + -output runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl19-passage.txt \ -impact & ``` Evaluation can be performed using `trec_eval`: -``` -tools/eval/trec_eval.9.0.4/trec_eval -m map -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl19-passage.txt -tools/eval/trec_eval.9.0.4/trec_eval -m ndcg_cut.10 -c src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl19-passage.txt -tools/eval/trec_eval.9.0.4/trec_eval -m recall.100 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl19-passage.txt -tools/eval/trec_eval.9.0.4/trec_eval -m recall.1000 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl19-passage.txt +```bash +tools/eval/trec_eval.9.0.4/trec_eval -m map -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl19-passage.txt +tools/eval/trec_eval.9.0.4/trec_eval -m ndcg_cut.10 -c src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl19-passage.txt +tools/eval/trec_eval.9.0.4/trec_eval -m recall.100 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl19-passage.txt +tools/eval/trec_eval.9.0.4/trec_eval -m recall.1000 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl19-passage.txt ``` ## Effectiveness @@ -82,3 +107,7 @@ With the above commands, you should be able to reproduce the following results: | R@1000 | BM25 (default parameters, quantized 8 bits)| |:-------------------------------------------------------------------------------------------------------------|-----------| | [DL19 (Passage)](https://trec.nist.gov/data/deep2019.html) | 0.7639 | + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](../src/main/resources/docgen/templates/dl19-passage-bm25-b8.template) and run `bin/build.sh` to rebuild the documentation. diff --git a/docs/regressions-dl19-passage-splade-distil-cocodenser-medium.md b/docs/regressions-dl19-passage-splade-distil-cocodenser-medium.md index 71a45e0200..8a32999d52 100644 --- a/docs/regressions-dl19-passage-splade-distil-cocodenser-medium.md +++ b/docs/regressions-dl19-passage-splade-distil-cocodenser-medium.md @@ -13,41 +13,43 @@ Note that this page is automatically generated from [this template](../src/main/ From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-splade-distil-cocodenser-medium ``` -## Corpus - -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with infrerence applied to generate the lexical representations). +We make available a version of the MS MARCO passage corpus that has already been processed with SPLADE-distil CoCodenser Medium, i.e., performed model inference on every document and stored the output sparse vectors. Thus, no neural inference is involved. -For details on how to train SPLADE-distil CoCodenser Medium and perform inference, please see [guide provided by Naver Labs Europe](https://github.com/naver/splade/tree/main/anserini_evaluation). -Download the corpus and unpack into `collections/`: +From any machine, the following command will download the corpus and perform the complete regression, end to end: +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression dl19-passage-splade-distil-cocodenser-medium ``` -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar -P collections/ -tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ -``` +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. -To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `54a81e855a7678bc83ecb3ecf1ac5c1c`. +## Corpus Download -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: +Download the corpus and unpack into `collections/`: +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar -P collections/ +tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ ``` -python src/main/python/run_regression.py --index --verify --search \ - --regression dl19-passage-splade-distil-cocodenser-medium \ + +To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `f77239a26d08856e6491a34062893b0c`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash +python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-splade-distil-cocodenser-medium \ --corpus-path collections/msmarco-passage-splade_distil_cocodenser_medium ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-passage-splade_distil_cocodenser_medium \ @@ -72,7 +74,7 @@ The original data can be found [here](https://trec.nist.gov/data/deep2019.html). After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-splade_distil_cocodenser_medium/ \ -topics src/main/resources/topics-and-qrels/topics.dl19-passage.splade_distil_cocodenser_medium.tsv.gz \ @@ -83,7 +85,7 @@ target/appassembler/bin/SearchCollection \ Evaluation can be performed using `trec_eval`: -``` +```bash tools/eval/trec_eval.9.0.4/trec_eval -m map -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.dl19-passage.splade_distil_cocodenser_medium.txt tools/eval/trec_eval.9.0.4/trec_eval -m ndcg_cut.10 -c src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.dl19-passage.splade_distil_cocodenser_medium.txt tools/eval/trec_eval.9.0.4/trec_eval -m recall.100 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.dl19-passage.splade_distil_cocodenser_medium.txt diff --git a/docs/regressions-dl20-passage-bm25-b8.md b/docs/regressions-dl20-passage-bm25-b8.md index c6292895cb..f3ecb70376 100644 --- a/docs/regressions-dl20-passage-bm25-b8.md +++ b/docs/regressions-dl20-passage-bm25-b8.md @@ -12,25 +12,50 @@ Note that this page is automatically generated from [this template](../src/main/ From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression dl20-passage-bm25-b8 ``` +From any machine, the following command will download the corpus (as quantized BM25 weights) and perform the complete regression, end to end: + +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression dl20-passage-bm25-b8 +``` + +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar -P collections/ +tar xvf collections/msmarco-passage-bm25-b8.tar -C collections/ +``` + +To confirm, `msmarco-passage-bm25-b8.tar` is 1.2 GB and has MD5 checksum `0a623e2c97ac6b7e814bf1323a97b435`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash +python src/main/python/run_regression.py --index --verify --search --regression dl20-passage-bm25-b8 \ + --corpus-path collections/msmarco-passage-bm25-b8 +``` + ## Indexing Typical indexing command: -``` +```bash target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ - -input /path/to/msmarco-passage \ + -input /path/to/msmarco-passage-bm25-b8 \ -index indexes/lucene-index.msmarco-passage-bm25-b8/ \ -generator DefaultLuceneDocumentGenerator \ -threads 9 -impact -pretokenized \ - >& logs/log.msmarco-passage & + >& logs/log.msmarco-passage-bm25-b8 & ``` -The directory `/path/to/msmarco-passage/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document +The directory `/path/to/msmarco-passage-bm25-b8/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -42,22 +67,22 @@ The original data can be found [here](https://trec.nist.gov/data/deep2020.html). After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-bm25-b8/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.txt \ -topicreader TsvInt \ - -output runs/run.msmarco-passage.bm25-b8.topics.dl20.txt \ + -output runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl20.txt \ -impact & ``` Evaluation can be performed using `trec_eval`: -``` -tools/eval/trec_eval.9.0.4/trec_eval -m map -c -l 2 src/main/resources/topics-and-qrels/qrels.dl20-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -m ndcg_cut.10 -c src/main/resources/topics-and-qrels/qrels.dl20-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -m recall.100 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl20-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl20.txt -tools/eval/trec_eval.9.0.4/trec_eval -m recall.1000 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl20-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl20.txt +```bash +tools/eval/trec_eval.9.0.4/trec_eval -m map -c -l 2 src/main/resources/topics-and-qrels/qrels.dl20-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -m ndcg_cut.10 -c src/main/resources/topics-and-qrels/qrels.dl20-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -m recall.100 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl20-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl20.txt +tools/eval/trec_eval.9.0.4/trec_eval -m recall.1000 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl20-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl20.txt ``` ## Effectiveness @@ -82,3 +107,7 @@ With the above commands, you should be able to reproduce the following results: | R@1000 | BM25 (default parameters, quantized 8 bits)| |:-------------------------------------------------------------------------------------------------------------|-----------| | [DL20 (Passage)](https://trec.nist.gov/data/deep2020.html) | 0.8119 | + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](../src/main/resources/docgen/templates/dl20-passage-bm25-b8.template) and run `bin/build.sh` to rebuild the documentation. diff --git a/docs/regressions-dl20-passage-splade-distil-cocodenser-medium.md b/docs/regressions-dl20-passage-splade-distil-cocodenser-medium.md index a645205984..8e8a4daae5 100644 --- a/docs/regressions-dl20-passage-splade-distil-cocodenser-medium.md +++ b/docs/regressions-dl20-passage-splade-distil-cocodenser-medium.md @@ -13,41 +13,43 @@ Note that this page is automatically generated from [this template](../src/main/ From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression dl20-passage-splade-distil-cocodenser-medium ``` -## Corpus - -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with infrerence applied to generate the lexical representations). +We make available a version of the MS MARCO passage corpus that has already been processed with SPLADE-distil CoCodenser Medium, i.e., performed model inference on every document and stored the output sparse vectors. Thus, no neural inference is involved. -For details on how to train SPLADE-distil CoCodenser Medium and perform inference, please see [guide provided by Naver Labs Europe](https://github.com/naver/splade/tree/main/anserini_evaluation). -Download the corpus and unpack into `collections/`: +From any machine, the following command will download the corpus and perform the complete regression, end to end: +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression dl20-passage-splade-distil-cocodenser-medium ``` -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar -P collections/ -tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ -``` +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. -To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `54a81e855a7678bc83ecb3ecf1ac5c1c`. +## Corpus Download -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: +Download the corpus and unpack into `collections/`: +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar -P collections/ +tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ ``` -python src/main/python/run_regression.py --index --verify --search \ - --regression dl20-passage-splade-distil-cocodenser-medium \ + +To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `f77239a26d08856e6491a34062893b0c`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash +python src/main/python/run_regression.py --index --verify --search --regression dl20-passage-splade-distil-cocodenser-medium \ --corpus-path collections/msmarco-passage-splade_distil_cocodenser_medium ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-passage-splade_distil_cocodenser_medium \ @@ -72,7 +74,7 @@ The original data can be found [here](https://trec.nist.gov/data/deep2020.html). After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-splade_distil_cocodenser_medium/ \ -topics src/main/resources/topics-and-qrels/topics.dl20.splade_distil_cocodenser_medium.tsv.gz \ @@ -83,7 +85,7 @@ target/appassembler/bin/SearchCollection \ Evaluation can be performed using `trec_eval`: -``` +```bash tools/eval/trec_eval.9.0.4/trec_eval -m map -c -l 2 src/main/resources/topics-and-qrels/qrels.dl20-passage.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.dl20.splade_distil_cocodenser_medium.txt tools/eval/trec_eval.9.0.4/trec_eval -m ndcg_cut.10 -c src/main/resources/topics-and-qrels/qrels.dl20-passage.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.dl20.splade_distil_cocodenser_medium.txt tools/eval/trec_eval.9.0.4/trec_eval -m recall.100 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl20-passage.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.dl20.splade_distil_cocodenser_medium.txt diff --git a/docs/regressions-msmarco-passage-bm25-b8.md b/docs/regressions-msmarco-passage-bm25-b8.md index abcdda0055..b3721a1629 100644 --- a/docs/regressions-msmarco-passage-bm25-b8.md +++ b/docs/regressions-msmarco-passage-bm25-b8.md @@ -10,25 +10,50 @@ Note that this page is automatically generated from [this template](../src/main/ From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-bm25-b8 ``` +From any machine, the following command will download the corpus (as quantized BM25 weights) and perform the complete regression, end to end: + +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression msmarco-passage-bm25-b8 +``` + +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar -P collections/ +tar xvf collections/msmarco-passage-bm25-b8.tar -C collections/ +``` + +To confirm, `msmarco-passage-bm25-b8.tar` is 1.2 GB and has MD5 checksum `0a623e2c97ac6b7e814bf1323a97b435`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash +python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-bm25-b8 \ + --corpus-path collections/msmarco-passage-bm25-b8 +``` + ## Indexing Typical indexing command: -``` +```bash target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ - -input /path/to/msmarco-passage \ + -input /path/to/msmarco-passage-bm25-b8 \ -index indexes/lucene-index.msmarco-passage-bm25-b8/ \ -generator DefaultLuceneDocumentGenerator \ -threads 9 -impact -pretokenized \ - >& logs/log.msmarco-passage & + >& logs/log.msmarco-passage-bm25-b8 & ``` -The directory `/path/to/msmarco-passage/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document +The directory `/path/to/msmarco-passage-bm25-b8/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -39,22 +64,22 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-bm25-b8/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \ -topicreader TsvInt \ - -output runs/run.msmarco-passage.bm25-b8.topics.msmarco-passage.dev-subset.txt \ + -output runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.msmarco-passage.dev-subset.txt \ -impact & ``` Evaluation can be performed using `trec_eval`: -``` -tools/eval/trec_eval.9.0.4/trec_eval -c -m map src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bm25-b8.topics.msmarco-passage.dev-subset.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bm25-b8.topics.msmarco-passage.dev-subset.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bm25-b8.topics.msmarco-passage.dev-subset.txt -tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bm25-b8.topics.msmarco-passage.dev-subset.txt +```bash +tools/eval/trec_eval.9.0.4/trec_eval -c -m map src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.msmarco-passage.dev-subset.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.msmarco-passage.dev-subset.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.msmarco-passage.dev-subset.txt +tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.msmarco-passage.dev-subset.txt ``` ## Effectiveness @@ -79,3 +104,7 @@ With the above commands, you should be able to reproduce the following results: | R@1000 | BM25 (default parameters, quantized 8 bits)| |:-------------------------------------------------------------------------------------------------------------|-----------| | [MS MARCO Passage: Dev](https://github.com/microsoft/MSMARCO-Passage-Ranking) | 0.8562 | + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](../src/main/resources/docgen/templates/msmarco-passage-bm25-b8.template) and run `bin/build.sh` to rebuild the documentation. diff --git a/docs/regressions-msmarco-passage-deepimpact.md b/docs/regressions-msmarco-passage-deepimpact.md index 6cc12c2290..78e3c1592e 100644 --- a/docs/regressions-msmarco-passage-deepimpact.md +++ b/docs/regressions-msmarco-passage-deepimpact.md @@ -12,39 +12,42 @@ Note that this page is automatically generated from [this template](../src/main/ From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-deepimpact ``` -## Corpus +We make available a version of the MS MARCO passage corpus that has already been processed with DeepImpact, i.e., we have applied neural inference and stored the output sparse vectors. + +From any machine, the following command will download the corpus and perform the complete regression, end to end: -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). -Thus, no neural inference is involved. +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression msmarco-passage-deepimpact +``` + +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download Download the corpus and unpack into `collections/`: ```bash wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact.tar -P collections/ - tar xvf collections/msmarco-passage-deepimpact.tar -C collections/ ``` -To confirm, `msmarco-passage-deepimpact.tar` is 3.6 GB and has MD5 checksum `fe827eb13ca3270bebe26b3f6b99f550`. - -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: +To confirm, `msmarco-passage-deepimpact.tar` is 3.6 GB and has MD5 checksum `73843885b503af3c8b3ee62e5f5a9900`. +With the corpus downloaded, the following command will perform the remaining steps below: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-deepimpact \ --corpus-path collections/msmarco-passage-deepimpact ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-passage-deepimpact \ @@ -68,7 +71,7 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-deepimpact/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.deepimpact.tsv.gz \ @@ -79,7 +82,7 @@ target/appassembler/bin/SearchCollection \ Evaluation can be performed using `trec_eval`: -``` +```bash tools/eval/trec_eval.9.0.4/trec_eval -c -m map src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.txt tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.txt tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-deepimpact.deepimpact.topics.msmarco-passage.dev-subset.deepimpact.txt diff --git a/docs/regressions-msmarco-passage-distill-splade-max.md b/docs/regressions-msmarco-passage-distill-splade-max.md index 1f2d50b27c..4811202c72 100644 --- a/docs/regressions-msmarco-passage-distill-splade-max.md +++ b/docs/regressions-msmarco-passage-distill-splade-max.md @@ -12,42 +12,43 @@ Note that this page is automatically generated from [this template](../src/main/ From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-distill-splade-max ``` -## Corpus - -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). +We make available a version of the MS MARCO passage corpus that has already been processed with DistilSPLADE-max, i.e., performed model inference on every document and stored the output sparse vectors. Thus, no neural inference is involved. -Download the corpus and unpack into `collections/`: +From any machine, the following command will download the corpus and perform the complete regression, end to end: ```bash -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar -P collections/ - -# Alternate mirror: -# wget https://vault.cs.uwaterloo.ca/s/poCLbJDMm7JxwPk/download -O collections/msmarco-passage-distill-splade-max.tar - -tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/ +python src/main/python/run_regression.py --download --index --verify --search --regression msmarco-passage-distill-splade-max ``` -To confirm, `msmarco-passage-distill-splade-max.tar` is 9.9 GB and has MD5 checksum `95b89a7dfd88f3685edcc2d1ffb120d1`. +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: +Download the corpus and unpack into `collections/`: +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar -P collections/ +tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ ``` + +To confirm, `msmarco-passage-distill-splade-max.tar` is 9.9 GB and has MD5 checksum `b5d126f5d9a8e1b3ef3f5cb0ba651725`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-distill-splade-max \ --corpus-path collections/msmarco-passage-distill-splade-max ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-passage-distill-splade-max \ @@ -71,7 +72,7 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-distill-splade-max/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz \ @@ -82,7 +83,7 @@ target/appassembler/bin/SearchCollection \ Evaluation can be performed using `trec_eval`: -``` +```bash tools/eval/trec_eval.9.0.4/trec_eval -c -m map src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.txt tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.txt tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-distill-splade-max.distill-splade-max.topics.msmarco-passage.dev-subset.distill-splade-max.txt diff --git a/docs/regressions-msmarco-passage-splade-distil-cocodenser-medium.md b/docs/regressions-msmarco-passage-splade-distil-cocodenser-medium.md index 9a1433403d..fed1666eba 100644 --- a/docs/regressions-msmarco-passage-splade-distil-cocodenser-medium.md +++ b/docs/regressions-msmarco-passage-splade-distil-cocodenser-medium.md @@ -10,16 +10,22 @@ Note that this page is automatically generated from [this template](../src/main/ From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: +```bash +python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-splade-distil-cocodenser-medium ``` -python src/main/python/run_regression.py --index --verify --search \ - --regression msmarco-passage-splade-distil-cocodenser-medium + +We make available a version of the MS MARCO passage corpus that has already been processed with SPLADE-distil CoCodenser Medium, i.e., performed model inference on every document and stored the output sparse vectors. +Thus, no neural inference is involved. + +From any machine, the following command will download the corpus and perform the complete regression, end to end: + +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression msmarco-passage-splade-distil-cocodenser-medium ``` -## Corpus +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). -Thus, no neural inference is involved. -For details on how to train SPLADE-distil CoCodenser Medium and perform inference, please see [guide provided by Naver Labs Europe](https://github.com/naver/splade/tree/main/anserini_evaluation). +## Corpus Download Download the corpus and unpack into `collections/`: @@ -28,23 +34,19 @@ wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_di tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ ``` -To confirm, `msmarco.tar` is 4.9 GB and has MD5 checksum `54a81e855a7678bc83ecb3ecf1ac5c1c`. +To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `f77239a26d08856e6491a34062893b0c`. +With the corpus downloaded, the following command will perform the remaining steps below: -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: - -``` -python src/main/python/run_regression.py --index --verify --search \ - --regression msmarco-passage-splade-distil-cocodenser-medium \ +```bash +python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-splade-distil-cocodenser-medium \ --corpus-path collections/msmarco-passage-splade_distil_cocodenser_medium ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-passage-splade_distil_cocodenser_medium \ @@ -68,7 +70,7 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-splade_distil_cocodenser_medium/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.splade_distil_cocodenser_medium.tsv.gz \ @@ -79,7 +81,7 @@ target/appassembler/bin/SearchCollection \ Evaluation can be performed using `trec_eval`: -``` +```bash tools/eval/trec_eval.9.0.4/trec_eval -c -m map src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.msmarco-passage.dev-subset.splade_distil_cocodenser_medium.txt tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.msmarco-passage.dev-subset.splade_distil_cocodenser_medium.txt tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.msmarco-passage.dev-subset.splade_distil_cocodenser_medium.txt diff --git a/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md b/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md index aeef346948..acf2a98ecb 100644 --- a/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md +++ b/docs/regressions-msmarco-passage-unicoil-tilde-expansion.md @@ -12,39 +12,43 @@ Note that this page is automatically generated from [this template](../src/main/ From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-unicoil-tilde-expansion ``` -## Corpus - -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). +We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL + TILDE expansions, i.e., performed model inference on every document and stored the output sparse vectors. Thus, no neural inference is involved. +From any machine, the following command will download the corpus and perform the complete regression, end to end: + +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression msmarco-passage-unicoil-tilde-expansion +``` + +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download + Download the corpus and unpack into `collections/`: ```bash wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-tilde-expansion.tar -P collections/ - tar xvf collections/msmarco-passage-unicoil-tilde-expansion.tar -C collections/ ``` -To confirm, `msmarco-passage-unicoil-tilde-expansion.tar` is 3.9 GB and has MD5 checksum `1685aee10071441987ad87f2e91f1706`. +To confirm, `msmarco-passage-unicoil-tilde-expansion.tar` is 3.9 GB and has MD5 checksum `12a9c289d94e32fd63a7d39c9677d75c`. +With the corpus downloaded, the following command will perform the remaining steps below: -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: - -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression msmarco-passage-unicoil-tilde-expansion \ --corpus-path collections/msmarco-passage-unicoil-tilde-expansion ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash target/appassembler/bin/IndexCollection \ -collection JsonVectorCollection \ -input /path/to/msmarco-passage-unicoil-tilde-expansion \ @@ -68,7 +72,7 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash target/appassembler/bin/SearchCollection \ -index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion/ \ -topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz \ @@ -79,7 +83,7 @@ target/appassembler/bin/SearchCollection \ Evaluation can be performed using `trec_eval`: -``` +```bash tools/eval/trec_eval.9.0.4/trec_eval -c -m map src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.txt tools/eval/trec_eval.9.0.4/trec_eval -c -M 10 -m recip_rank src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.txt tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.100 src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-unicoil-tilde-expansion.unicoil-tilde-expansion.topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.txt diff --git a/src/main/resources/docgen/templates/dl19-passage-bm25-b8.template b/src/main/resources/docgen/templates/dl19-passage-bm25-b8.template index 68a7358090..f37c443ed4 100644 --- a/src/main/resources/docgen/templates/dl19-passage-bm25-b8.template +++ b/src/main/resources/docgen/templates/dl19-passage-bm25-b8.template @@ -12,19 +12,44 @@ Note that this page is automatically generated from [this template](${template}) From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` +From any machine, the following command will download the corpus (as quantized BM25 weights) and perform the complete regression, end to end: + +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression ${test_name} +``` + +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar -P collections/ +tar xvf collections/msmarco-passage-bm25-b8.tar -C collections/ +``` + +To confirm, `msmarco-passage-bm25-b8.tar` is 1.2 GB and has MD5 checksum `0a623e2c97ac6b7e814bf1323a97b435`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash +python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ + --corpus-path collections/${corpus} +``` + ## Indexing Typical indexing command: -``` +```bash ${index_cmds} ``` -The directory `/path/to/msmarco-passage/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document +The directory `/path/to/${corpus}/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -36,13 +61,13 @@ The original data can be found [here](https://trec.nist.gov/data/deep2019.html). After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash ${ranking_cmds} ``` Evaluation can be performed using `trec_eval`: -``` +```bash ${eval_cmds} ``` @@ -51,3 +76,7 @@ ${eval_cmds} With the above commands, you should be able to reproduce the following results: ${effectiveness} + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](${template}) and run `bin/build.sh` to rebuild the documentation. diff --git a/src/main/resources/docgen/templates/dl19-passage-splade-distil-cocodenser-medium.template b/src/main/resources/docgen/templates/dl19-passage-splade-distil-cocodenser-medium.template index 6fa6201b9a..d368478bf9 100644 --- a/src/main/resources/docgen/templates/dl19-passage-splade-distil-cocodenser-medium.template +++ b/src/main/resources/docgen/templates/dl19-passage-splade-distil-cocodenser-medium.template @@ -13,41 +13,43 @@ Note that this page is automatically generated from [this template](${template}) From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` -## Corpus - -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with infrerence applied to generate the lexical representations). +We make available a version of the MS MARCO passage corpus that has already been processed with SPLADE-distil CoCodenser Medium, i.e., performed model inference on every document and stored the output sparse vectors. Thus, no neural inference is involved. -For details on how to train SPLADE-distil CoCodenser Medium and perform inference, please see [guide provided by Naver Labs Europe](https://github.com/naver/splade/tree/main/anserini_evaluation). -Download the corpus and unpack into `collections/`: +From any machine, the following command will download the corpus and perform the complete regression, end to end: +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression ${test_name} ``` -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar -P collections/ -tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ -``` +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. -To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `54a81e855a7678bc83ecb3ecf1ac5c1c`. +## Corpus Download -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: +Download the corpus and unpack into `collections/`: +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar -P collections/ +tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ ``` -python src/main/python/run_regression.py --index --verify --search \ - --regression ${test_name} \ + +To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `f77239a26d08856e6491a34062893b0c`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash +python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ --corpus-path collections/${corpus} ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash ${index_cmds} ``` @@ -66,13 +68,13 @@ The original data can be found [here](https://trec.nist.gov/data/deep2019.html). After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash ${ranking_cmds} ``` Evaluation can be performed using `trec_eval`: -``` +```bash ${eval_cmds} ``` diff --git a/src/main/resources/docgen/templates/dl20-passage-bm25-b8.template b/src/main/resources/docgen/templates/dl20-passage-bm25-b8.template index bd5190b080..3e5510e725 100644 --- a/src/main/resources/docgen/templates/dl20-passage-bm25-b8.template +++ b/src/main/resources/docgen/templates/dl20-passage-bm25-b8.template @@ -12,19 +12,44 @@ Note that this page is automatically generated from [this template](${template}) From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` +From any machine, the following command will download the corpus (as quantized BM25 weights) and perform the complete regression, end to end: + +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression ${test_name} +``` + +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar -P collections/ +tar xvf collections/msmarco-passage-bm25-b8.tar -C collections/ +``` + +To confirm, `msmarco-passage-bm25-b8.tar` is 1.2 GB and has MD5 checksum `0a623e2c97ac6b7e814bf1323a97b435`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash +python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ + --corpus-path collections/${corpus} +``` + ## Indexing Typical indexing command: -``` +```bash ${index_cmds} ``` -The directory `/path/to/msmarco-passage/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document +The directory `/path/to/${corpus}/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -36,13 +61,13 @@ The original data can be found [here](https://trec.nist.gov/data/deep2020.html). After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash ${ranking_cmds} ``` Evaluation can be performed using `trec_eval`: -``` +```bash ${eval_cmds} ``` @@ -51,3 +76,7 @@ ${eval_cmds} With the above commands, you should be able to reproduce the following results: ${effectiveness} + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](${template}) and run `bin/build.sh` to rebuild the documentation. diff --git a/src/main/resources/docgen/templates/dl20-passage-splade-distil-cocodenser-medium.template b/src/main/resources/docgen/templates/dl20-passage-splade-distil-cocodenser-medium.template index ec107ac111..fbc2e1ed45 100644 --- a/src/main/resources/docgen/templates/dl20-passage-splade-distil-cocodenser-medium.template +++ b/src/main/resources/docgen/templates/dl20-passage-splade-distil-cocodenser-medium.template @@ -13,41 +13,43 @@ Note that this page is automatically generated from [this template](${template}) From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` -## Corpus - -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with infrerence applied to generate the lexical representations). +We make available a version of the MS MARCO passage corpus that has already been processed with SPLADE-distil CoCodenser Medium, i.e., performed model inference on every document and stored the output sparse vectors. Thus, no neural inference is involved. -For details on how to train SPLADE-distil CoCodenser Medium and perform inference, please see [guide provided by Naver Labs Europe](https://github.com/naver/splade/tree/main/anserini_evaluation). -Download the corpus and unpack into `collections/`: +From any machine, the following command will download the corpus and perform the complete regression, end to end: +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression ${test_name} ``` -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar -P collections/ -tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ -``` +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. -To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `54a81e855a7678bc83ecb3ecf1ac5c1c`. +## Corpus Download -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: +Download the corpus and unpack into `collections/`: +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar -P collections/ +tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ ``` -python src/main/python/run_regression.py --index --verify --search \ - --regression ${test_name} \ + +To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `f77239a26d08856e6491a34062893b0c`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash +python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ --corpus-path collections/${corpus} ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash ${index_cmds} ``` @@ -66,13 +68,13 @@ The original data can be found [here](https://trec.nist.gov/data/deep2020.html). After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash ${ranking_cmds} ``` Evaluation can be performed using `trec_eval`: -``` +```bash ${eval_cmds} ``` diff --git a/src/main/resources/docgen/templates/msmarco-passage-bm25-b8.template b/src/main/resources/docgen/templates/msmarco-passage-bm25-b8.template index 218782bc2d..43eeadacf8 100644 --- a/src/main/resources/docgen/templates/msmarco-passage-bm25-b8.template +++ b/src/main/resources/docgen/templates/msmarco-passage-bm25-b8.template @@ -10,19 +10,44 @@ Note that this page is automatically generated from [this template](${template}) From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` +From any machine, the following command will download the corpus (as quantized BM25 weights) and perform the complete regression, end to end: + +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression ${test_name} +``` + +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download + +Download the corpus and unpack into `collections/`: + +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar -P collections/ +tar xvf collections/msmarco-passage-bm25-b8.tar -C collections/ +``` + +To confirm, `msmarco-passage-bm25-b8.tar` is 1.2 GB and has MD5 checksum `0a623e2c97ac6b7e814bf1323a97b435`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash +python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ + --corpus-path collections/${corpus} +``` + ## Indexing Typical indexing command: -``` +```bash ${index_cmds} ``` -The directory `/path/to/msmarco-passage/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document +The directory `/path/to/${corpus}/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document For additional details, see explanation of [common indexing options](common-indexing-options.md). @@ -33,13 +58,13 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash ${ranking_cmds} ``` Evaluation can be performed using `trec_eval`: -``` +```bash ${eval_cmds} ``` @@ -48,3 +73,7 @@ ${eval_cmds} With the above commands, you should be able to reproduce the following results: ${effectiveness} + +## Reproduction Log[*](reproducibility.md) + +To add to this reproduction log, modify [this template](${template}) and run `bin/build.sh` to rebuild the documentation. diff --git a/src/main/resources/docgen/templates/msmarco-passage-deepimpact.template b/src/main/resources/docgen/templates/msmarco-passage-deepimpact.template index 53286ead77..db6f41ea9d 100644 --- a/src/main/resources/docgen/templates/msmarco-passage-deepimpact.template +++ b/src/main/resources/docgen/templates/msmarco-passage-deepimpact.template @@ -12,39 +12,42 @@ Note that this page is automatically generated from [this template](${template}) From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` -## Corpus +We make available a version of the MS MARCO passage corpus that has already been processed with DeepImpact, i.e., we have applied neural inference and stored the output sparse vectors. + +From any machine, the following command will download the corpus and perform the complete regression, end to end: -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). -Thus, no neural inference is involved. +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression ${test_name} +``` + +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download Download the corpus and unpack into `collections/`: ```bash wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact.tar -P collections/ - tar xvf collections/msmarco-passage-deepimpact.tar -C collections/ ``` -To confirm, `msmarco-passage-deepimpact.tar` is 3.6 GB and has MD5 checksum `fe827eb13ca3270bebe26b3f6b99f550`. - -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: +To confirm, `msmarco-passage-deepimpact.tar` is 3.6 GB and has MD5 checksum `73843885b503af3c8b3ee62e5f5a9900`. +With the corpus downloaded, the following command will perform the remaining steps below: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ --corpus-path collections/${corpus} ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash ${index_cmds} ``` @@ -62,13 +65,13 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash ${ranking_cmds} ``` Evaluation can be performed using `trec_eval`: -``` +```bash ${eval_cmds} ``` diff --git a/src/main/resources/docgen/templates/msmarco-passage-distill-splade-max.template b/src/main/resources/docgen/templates/msmarco-passage-distill-splade-max.template index 58e67fab5a..e8568268dc 100644 --- a/src/main/resources/docgen/templates/msmarco-passage-distill-splade-max.template +++ b/src/main/resources/docgen/templates/msmarco-passage-distill-splade-max.template @@ -12,42 +12,43 @@ Note that this page is automatically generated from [this template](${template}) From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` -## Corpus - -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). +We make available a version of the MS MARCO passage corpus that has already been processed with DistilSPLADE-max, i.e., performed model inference on every document and stored the output sparse vectors. Thus, no neural inference is involved. -Download the corpus and unpack into `collections/`: +From any machine, the following command will download the corpus and perform the complete regression, end to end: ```bash -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar -P collections/ - -# Alternate mirror: -# wget https://vault.cs.uwaterloo.ca/s/poCLbJDMm7JxwPk/download -O collections/msmarco-passage-distill-splade-max.tar - -tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/ +python src/main/python/run_regression.py --download --index --verify --search --regression ${test_name} ``` -To confirm, `msmarco-passage-distill-splade-max.tar` is 9.9 GB and has MD5 checksum `95b89a7dfd88f3685edcc2d1ffb120d1`. +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: +Download the corpus and unpack into `collections/`: +```bash +wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar -P collections/ +tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ ``` + +To confirm, `msmarco-passage-distill-splade-max.tar` is 9.9 GB and has MD5 checksum `b5d126f5d9a8e1b3ef3f5cb0ba651725`. +With the corpus downloaded, the following command will perform the remaining steps below: + +```bash python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ --corpus-path collections/${corpus} ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash ${index_cmds} ``` @@ -65,13 +66,13 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash ${ranking_cmds} ``` Evaluation can be performed using `trec_eval`: -``` +```bash ${eval_cmds} ``` diff --git a/src/main/resources/docgen/templates/msmarco-passage-splade-distil-cocodenser-medium.template b/src/main/resources/docgen/templates/msmarco-passage-splade-distil-cocodenser-medium.template index faa004dc33..cf2561114d 100644 --- a/src/main/resources/docgen/templates/msmarco-passage-splade-distil-cocodenser-medium.template +++ b/src/main/resources/docgen/templates/msmarco-passage-splade-distil-cocodenser-medium.template @@ -10,16 +10,22 @@ Note that this page is automatically generated from [this template](${template}) From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: +```bash +python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` -python src/main/python/run_regression.py --index --verify --search \ - --regression ${test_name} + +We make available a version of the MS MARCO passage corpus that has already been processed with SPLADE-distil CoCodenser Medium, i.e., performed model inference on every document and stored the output sparse vectors. +Thus, no neural inference is involved. + +From any machine, the following command will download the corpus and perform the complete regression, end to end: + +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression ${test_name} ``` -## Corpus +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). -Thus, no neural inference is involved. -For details on how to train SPLADE-distil CoCodenser Medium and perform inference, please see [guide provided by Naver Labs Europe](https://github.com/naver/splade/tree/main/anserini_evaluation). +## Corpus Download Download the corpus and unpack into `collections/`: @@ -28,23 +34,19 @@ wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_di tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/ ``` -To confirm, `msmarco.tar` is 4.9 GB and has MD5 checksum `54a81e855a7678bc83ecb3ecf1ac5c1c`. +To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `f77239a26d08856e6491a34062893b0c`. +With the corpus downloaded, the following command will perform the remaining steps below: -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: - -``` -python src/main/python/run_regression.py --index --verify --search \ - --regression ${test_name} \ - --corpus-path collections/msmarco-passage-splade_distil_cocodenser_medium +```bash +python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ + --corpus-path collections/${corpus} ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash ${index_cmds} ``` @@ -62,13 +64,13 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash ${ranking_cmds} ``` Evaluation can be performed using `trec_eval`: -``` +```bash ${eval_cmds} ``` diff --git a/src/main/resources/docgen/templates/msmarco-passage-unicoil-tilde-expansion.template b/src/main/resources/docgen/templates/msmarco-passage-unicoil-tilde-expansion.template index 3c1f7756a8..b2eeacd464 100644 --- a/src/main/resources/docgen/templates/msmarco-passage-unicoil-tilde-expansion.template +++ b/src/main/resources/docgen/templates/msmarco-passage-unicoil-tilde-expansion.template @@ -12,39 +12,43 @@ Note that this page is automatically generated from [this template](${template}) From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end: -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression ${test_name} ``` -## Corpus - -We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with inference applied to generate the lexical representations). +We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL + TILDE expansions, i.e., performed model inference on every document and stored the output sparse vectors. Thus, no neural inference is involved. +From any machine, the following command will download the corpus and perform the complete regression, end to end: + +```bash +python src/main/python/run_regression.py --download --index --verify --search --regression ${test_name} +``` + +The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results. + +## Corpus Download + Download the corpus and unpack into `collections/`: ```bash wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-tilde-expansion.tar -P collections/ - tar xvf collections/msmarco-passage-unicoil-tilde-expansion.tar -C collections/ ``` -To confirm, `msmarco-passage-unicoil-tilde-expansion.tar` is 3.9 GB and has MD5 checksum `1685aee10071441987ad87f2e91f1706`. +To confirm, `msmarco-passage-unicoil-tilde-expansion.tar` is 3.9 GB and has MD5 checksum `12a9c289d94e32fd63a7d39c9677d75c`. +With the corpus downloaded, the following command will perform the remaining steps below: -With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine: - -``` +```bash python src/main/python/run_regression.py --index --verify --search --regression ${test_name} \ --corpus-path collections/${corpus} ``` -Alternatively, you can simply copy/paste from the commands below and obtain the same results. - ## Indexing Sample indexing command: -``` +```bash ${index_cmds} ``` @@ -62,13 +66,13 @@ The regression experiments here evaluate on the 6980 dev set questions; see [thi After indexing has completed, you should be able to perform retrieval as follows: -``` +```bash ${ranking_cmds} ``` Evaluation can be performed using `trec_eval`: -``` +```bash ${eval_cmds} ``` diff --git a/src/main/resources/regression/dl19-passage-bm25-b8.yaml b/src/main/resources/regression/dl19-passage-bm25-b8.yaml index 6268c8d4e0..2c351c5293 100644 --- a/src/main/resources/regression/dl19-passage-bm25-b8.yaml +++ b/src/main/resources/regression/dl19-passage-bm25-b8.yaml @@ -1,7 +1,10 @@ --- -corpus: msmarco-passage +corpus: msmarco-passage-bm25-b8 corpus_path: collections/msmarco/msmarco-passage-bm25-b8/ +download_url: https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar +download_checksum: 0a623e2c97ac6b7e814bf1323a97b435 + index_path: indexes/lucene-index.msmarco-passage-bm25-b8/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator diff --git a/src/main/resources/regression/dl19-passage-splade-distil-cocodenser-medium.yaml b/src/main/resources/regression/dl19-passage-splade-distil-cocodenser-medium.yaml index 5a399a98ba..93052d9f1e 100644 --- a/src/main/resources/regression/dl19-passage-splade-distil-cocodenser-medium.yaml +++ b/src/main/resources/regression/dl19-passage-splade-distil-cocodenser-medium.yaml @@ -2,6 +2,9 @@ corpus: msmarco-passage-splade_distil_cocodenser_medium corpus_path: collections/msmarco/msmarco-passage-splade_distil_cocodenser_medium +download_url: https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar +download_checksum: f77239a26d08856e6491a34062893b0c + index_path: indexes/lucene-index.msmarco-passage-splade_distil_cocodenser_medium/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator diff --git a/src/main/resources/regression/dl20-passage-bm25-b8.yaml b/src/main/resources/regression/dl20-passage-bm25-b8.yaml index 8891b3dd76..af10a3b1ec 100644 --- a/src/main/resources/regression/dl20-passage-bm25-b8.yaml +++ b/src/main/resources/regression/dl20-passage-bm25-b8.yaml @@ -1,7 +1,10 @@ --- -corpus: msmarco-passage +corpus: msmarco-passage-bm25-b8 corpus_path: collections/msmarco/msmarco-passage-bm25-b8/ +download_url: https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar +download_checksum: 0a623e2c97ac6b7e814bf1323a97b435 + index_path: indexes/lucene-index.msmarco-passage-bm25-b8/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator diff --git a/src/main/resources/regression/dl20-passage-splade-distil-cocodenser-medium.yaml b/src/main/resources/regression/dl20-passage-splade-distil-cocodenser-medium.yaml index 753905f704..3dd0392beb 100644 --- a/src/main/resources/regression/dl20-passage-splade-distil-cocodenser-medium.yaml +++ b/src/main/resources/regression/dl20-passage-splade-distil-cocodenser-medium.yaml @@ -2,6 +2,9 @@ corpus: msmarco-passage-splade_distil_cocodenser_medium corpus_path: collections/msmarco/msmarco-passage-splade_distil_cocodenser_medium +download_url: https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar +download_checksum: f77239a26d08856e6491a34062893b0c + index_path: indexes/lucene-index.msmarco-passage-splade_distil_cocodenser_medium/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator diff --git a/src/main/resources/regression/msmarco-passage-bm25-b8.yaml b/src/main/resources/regression/msmarco-passage-bm25-b8.yaml index 43ef536094..ddf313dca1 100644 --- a/src/main/resources/regression/msmarco-passage-bm25-b8.yaml +++ b/src/main/resources/regression/msmarco-passage-bm25-b8.yaml @@ -1,7 +1,10 @@ --- -corpus: msmarco-passage +corpus: msmarco-passage-bm25-b8 corpus_path: collections/msmarco/msmarco-passage-bm25-b8/ +download_url: https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar +download_checksum: 0a623e2c97ac6b7e814bf1323a97b435 + index_path: indexes/lucene-index.msmarco-passage-bm25-b8/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator diff --git a/src/main/resources/regression/msmarco-passage-deepimpact.yaml b/src/main/resources/regression/msmarco-passage-deepimpact.yaml index 8096356e6c..0ee3bc6444 100644 --- a/src/main/resources/regression/msmarco-passage-deepimpact.yaml +++ b/src/main/resources/regression/msmarco-passage-deepimpact.yaml @@ -2,6 +2,9 @@ corpus: msmarco-passage-deepimpact corpus_path: collections/msmarco/msmarco-passage-deepimpact/ +download_url: https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact.tar +download_checksum: 73843885b503af3c8b3ee62e5f5a9900 + index_path: indexes/lucene-index.msmarco-passage-deepimpact/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator diff --git a/src/main/resources/regression/msmarco-passage-distill-splade-max.yaml b/src/main/resources/regression/msmarco-passage-distill-splade-max.yaml index a2b9ba4aa7..b62e7b5f27 100644 --- a/src/main/resources/regression/msmarco-passage-distill-splade-max.yaml +++ b/src/main/resources/regression/msmarco-passage-distill-splade-max.yaml @@ -2,6 +2,9 @@ corpus: msmarco-passage-distill-splade-max corpus_path: collections/msmarco/msmarco-passage-distill-splade-max/ +download_url: https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar +download_checksum: b5d126f5d9a8e1b3ef3f5cb0ba651725 + index_path: indexes/lucene-index.msmarco-passage-distill-splade-max/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator diff --git a/src/main/resources/regression/msmarco-passage-splade-distil-cocodenser-medium.yaml b/src/main/resources/regression/msmarco-passage-splade-distil-cocodenser-medium.yaml index 6dee08a2c1..adbc5065f1 100644 --- a/src/main/resources/regression/msmarco-passage-splade-distil-cocodenser-medium.yaml +++ b/src/main/resources/regression/msmarco-passage-splade-distil-cocodenser-medium.yaml @@ -2,6 +2,9 @@ corpus: msmarco-passage-splade_distil_cocodenser_medium corpus_path: collections/msmarco/msmarco-passage-splade_distil_cocodenser_medium +download_url: https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar +download_checksum: f77239a26d08856e6491a34062893b0c + index_path: indexes/lucene-index.msmarco-passage-splade_distil_cocodenser_medium/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator diff --git a/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml b/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml index 5e56cbfff6..19b9ee2988 100644 --- a/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml +++ b/src/main/resources/regression/msmarco-passage-unicoil-tilde-expansion.yaml @@ -2,6 +2,9 @@ corpus: msmarco-passage-unicoil-tilde-expansion corpus_path: collections/msmarco/msmarco-passage-unicoil-tilde-expansion/ +download_url: https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-tilde-expansion.tar +download_checksum: 12a9c289d94e32fd63a7d39c9677d75c + index_path: indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion/ collection_class: JsonVectorCollection generator_class: DefaultLuceneDocumentGenerator