Skip to content

Commit

Permalink
Add --download option in run_regression.py for more MS MARCO V1 passa…
Browse files Browse the repository at this point in the history
…ge conditions (#1901)
  • Loading branch information
lintool authored Jun 13, 2022
1 parent e528417 commit dc07344
Show file tree
Hide file tree
Showing 28 changed files with 502 additions and 268 deletions.
35 changes: 20 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,18 +57,18 @@ See individual pages for details!
|---|:---:|:----:|:----:|
| **Unsupervised Lexical** |
| BoW baselines | [+](docs/regressions-msmarco-passage.md) | [+](docs/regressions-dl19-passage.md) | [+](docs/regressions-dl20-passage.md) |
| Quantized BM25 | [+](docs/regressions-msmarco-passage-bm25-b8.md) | [+](docs/regressions-dl19-passage-bm25-b8.md) | [+](docs/regressions-dl20-passage-bm25-b8.md) |
| Quantized BM25 | [](docs/regressions-msmarco-passage-bm25-b8.md) | [](docs/regressions-dl19-passage-bm25-b8.md) | [](docs/regressions-dl20-passage-bm25-b8.md) |
| WP baselines | [+](docs/regressions-msmarco-passage-wp.md) | [+](docs/regressions-dl19-passage-wp.md) | [+](docs/regressions-dl20-passage-wp.md) |
| doc2query | [+](docs/regressions-msmarco-passage-doc2query.md) |
| doc2query-T5 | [+](docs/regressions-msmarco-passage-docTTTTTquery.md) | [+](docs/regressions-dl19-passage-docTTTTTquery.md) | [+](docs/regressions-dl20-passage-docTTTTTquery.md) |
| **Learned sparse lexical (uniCOIL family)** |
| uniCOIL noexp | [](docs/regressions-msmarco-passage-unicoil-noexp.md) | [](docs/regressions-dl19-passage-unicoil-noexp.md) | [](docs/regressions-dl20-passage-unicoil-noexp.md) |
| uniCOIL with doc2query-T5 | [](docs/regressions-msmarco-passage-unicoil.md) | [](docs/regressions-dl19-passage-unicoil.md) | [](docs/regressions-dl20-passage-unicoil.md) |
| uniCOIL with TILDE | [+](docs/regressions-msmarco-passage-unicoil-tilde-expansion.md) |
| uniCOIL with TILDE | [](docs/regressions-msmarco-passage-unicoil-tilde-expansion.md) |
| **Learned sparse lexical (other)** |
| DeepImpact | [+](docs/regressions-msmarco-passage-deepimpact.md) |
| SPLADEv2 | [+](docs/regressions-msmarco-passage-distill-splade-max.md) |
| SPLADE-distill CoCodenser-medium | [+](docs/regressions-msmarco-passage-splade-distil-cocodenser-medium.md) | [+](docs/regressions-dl19-passage-splade-distil-cocodenser-medium.md) | [+](docs/regressions-dl20-passage-splade-distil-cocodenser-medium.md) |
| DeepImpact | [](docs/regressions-msmarco-passage-deepimpact.md) |
| SPLADEv2 | [](docs/regressions-msmarco-passage-distill-splade-max.md) |
| SPLADE-distill CoCodenser-medium | [](docs/regressions-msmarco-passage-splade-distil-cocodenser-medium.md) | [](docs/regressions-dl19-passage-splade-distil-cocodenser-medium.md) | [](docs/regressions-dl20-passage-splade-distil-cocodenser-medium.md) |

### MS MARCO V1 Document Corpus

Expand Down Expand Up @@ -170,16 +170,21 @@ See individual pages for details!

### Available Corpora

| Corpora | Size | Checksum |
|:--------|-----:|:---------|
| [MS MARCO V1 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-noexp.tar) | 2.7 GB | `f17ddd8c7c00ff121c3c3b147d2e17d8` |
| [MS MARCO V1 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil.tar) | 3.4 GB | `78eef752c78c8691f7d61600ceed306f` |
| [MS MARCO V1 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar) | 11 GB | `11b226e1cacd9c8ae0a660fd14cdd710` |
| [MS MARCO V1 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar) | 19 GB | `6a00e2c0c375cb1e52c83ae5ac377ebb` |
| [MS MARCO V2 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_noexp_0shot.tar) | 24 GB | `d9cc1ed3049746e68a2c91bf90e5212d` |
| [MS MARCO V2 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_0shot.tar) | 41 GB | `1949a00bfd5e1f1a230a04bbc1f01539` |
| [MS MARCO V2 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar) | 55 GB | `97ba262c497164de1054f357caea0c63` |
| [MS MARCO V2 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar) | 72 GB | `c5639748c2cbad0152e10b0ebde3b804` |
| Corpora | Size | Checksum |
|:------------------------------------------------------------------------------------------------------------------------------------------------|-------:|:-----------------------------------|
| [MS MARCO V1 passage: Quantized BM25](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar) | 1.2 GB | `0a623e2c97ac6b7e814bf1323a97b435` |
| [MS MARCO V1 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-noexp.tar) | 2.7 GB | `f17ddd8c7c00ff121c3c3b147d2e17d8` |
| [MS MARCO V1 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil.tar) | 3.4 GB | `78eef752c78c8691f7d61600ceed306f` |
| [MS MARCO V1 passage: uniCOIL (TILDE)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-tilde-expansion.tar) | 3.9 GB | `12a9c289d94e32fd63a7d39c9677d75c` |
| [MS MARCO V1 passage: DeepImpact](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-deepimpact.tar) | 3.6 GB | `73843885b503af3c8b3ee62e5f5a9900` |
| [MS MARCO V1 passage: SPLADEv2](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar) | 9.9 GB | `b5d126f5d9a8e1b3ef3f5cb0ba651725` |
| [MS MARCO V1 passage: SPLADE CoCodenser](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar) | 4.9 GB | `f77239a26d08856e6491a34062893b0c` |
| [MS MARCO V1 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil-noexp.tar) | 11 GB | `11b226e1cacd9c8ae0a660fd14cdd710` |
| [MS MARCO V1 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-segmented-unicoil.tar) | 19 GB | `6a00e2c0c375cb1e52c83ae5ac377ebb` |
| [MS MARCO V2 passage: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_noexp_0shot.tar) | 24 GB | `d9cc1ed3049746e68a2c91bf90e5212d` |
| [MS MARCO V2 passage: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_0shot.tar) | 41 GB | `1949a00bfd5e1f1a230a04bbc1f01539` |
| [MS MARCO V2 doc: uniCOIL (noexp)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot_v2.tar) | 55 GB | `97ba262c497164de1054f357caea0c63` |
| [MS MARCO V2 doc: uniCOIL (d2q-T5)](https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar) | 72 GB | `c5639748c2cbad0152e10b0ebde3b804` |

## Additional Documentation

Expand Down
53 changes: 41 additions & 12 deletions docs/regressions-dl19-passage-bm25-b8.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,25 +12,50 @@ Note that this page is automatically generated from [this template](../src/main/

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
```bash
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-bm25-b8
```

From any machine, the following command will download the corpus (as quantized BM25 weights) and perform the complete regression, end to end:

```bash
python src/main/python/run_regression.py --download --index --verify --search --regression dl19-passage-bm25-b8
```

The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results.

## Corpus Download

Download the corpus and unpack into `collections/`:

```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-bm25-b8.tar -P collections/
tar xvf collections/msmarco-passage-bm25-b8.tar -C collections/
```

To confirm, `msmarco-passage-bm25-b8.tar` is 1.2 GB and has MD5 checksum `0a623e2c97ac6b7e814bf1323a97b435`.
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-bm25-b8 \
--corpus-path collections/msmarco-passage-bm25-b8
```

## Indexing

Typical indexing command:

```
```bash
target/appassembler/bin/IndexCollection \
-collection JsonVectorCollection \
-input /path/to/msmarco-passage \
-input /path/to/msmarco-passage-bm25-b8 \
-index indexes/lucene-index.msmarco-passage-bm25-b8/ \
-generator DefaultLuceneDocumentGenerator \
-threads 9 -impact -pretokenized \
>& logs/log.msmarco-passage &
>& logs/log.msmarco-passage-bm25-b8 &
```

The directory `/path/to/msmarco-passage/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document
The directory `/path/to/msmarco-passage-bm25-b8/` should be a directory containing `jsonl` files containing quantized BM25 vectors for every document

For additional details, see explanation of [common indexing options](common-indexing-options.md).

Expand All @@ -42,22 +67,22 @@ The original data can be found [here](https://trec.nist.gov/data/deep2019.html).

After indexing has completed, you should be able to perform retrieval as follows:

```
```bash
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.msmarco-passage-bm25-b8/ \
-topics src/main/resources/topics-and-qrels/topics.dl19-passage.txt \
-topicreader TsvInt \
-output runs/run.msmarco-passage.bm25-b8.topics.dl19-passage.txt \
-output runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl19-passage.txt \
-impact &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -m map -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl19-passage.txt
tools/eval/trec_eval.9.0.4/trec_eval -m ndcg_cut.10 -c src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl19-passage.txt
tools/eval/trec_eval.9.0.4/trec_eval -m recall.100 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl19-passage.txt
tools/eval/trec_eval.9.0.4/trec_eval -m recall.1000 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage.bm25-b8.topics.dl19-passage.txt
```bash
tools/eval/trec_eval.9.0.4/trec_eval -m map -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl19-passage.txt
tools/eval/trec_eval.9.0.4/trec_eval -m ndcg_cut.10 -c src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl19-passage.txt
tools/eval/trec_eval.9.0.4/trec_eval -m recall.100 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl19-passage.txt
tools/eval/trec_eval.9.0.4/trec_eval -m recall.1000 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-bm25-b8.bm25-b8.topics.dl19-passage.txt
```

## Effectiveness
Expand All @@ -82,3 +107,7 @@ With the above commands, you should be able to reproduce the following results:
| R@1000 | BM25 (default parameters, quantized 8 bits)|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| [DL19 (Passage)](https://trec.nist.gov/data/deep2019.html) | 0.7639 |

## Reproduction Log[*](reproducibility.md)

To add to this reproduction log, modify [this template](../src/main/resources/docgen/templates/dl19-passage-bm25-b8.template) and run `bin/build.sh` to rebuild the documentation.
38 changes: 20 additions & 18 deletions docs/regressions-dl19-passage-splade-distil-cocodenser-medium.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,41 +13,43 @@ Note that this page is automatically generated from [this template](../src/main/

From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:

```
```bash
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-splade-distil-cocodenser-medium
```

## Corpus

We make available a version of the MS MARCO passage corpus that has already been processed with the model (i.e., with infrerence applied to generate the lexical representations).
We make available a version of the MS MARCO passage corpus that has already been processed with SPLADE-distil CoCodenser Medium, i.e., performed model inference on every document and stored the output sparse vectors.
Thus, no neural inference is involved.
For details on how to train SPLADE-distil CoCodenser Medium and perform inference, please see [guide provided by Naver Labs Europe](https://github.com/naver/splade/tree/main/anserini_evaluation).

Download the corpus and unpack into `collections/`:
From any machine, the following command will download the corpus and perform the complete regression, end to end:

```bash
python src/main/python/run_regression.py --download --index --verify --search --regression dl19-passage-splade-distil-cocodenser-medium
```
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar -P collections/

tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/
```
The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results.

To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `54a81e855a7678bc83ecb3ecf1ac5c1c`.
## Corpus Download

With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine:
Download the corpus and unpack into `collections/`:

```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-splade_distil_cocodenser_medium.tar -P collections/
tar xvf collections/msmarco-passage-splade_distil_cocodenser_medium.tar -C collections/
```
python src/main/python/run_regression.py --index --verify --search \
--regression dl19-passage-splade-distil-cocodenser-medium \

To confirm, `msmarco-passage-splade_distil_cocodenser_medium.tar` is 4.9 GB and has MD5 checksum `f77239a26d08856e6491a34062893b0c`.
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-splade-distil-cocodenser-medium \
--corpus-path collections/msmarco-passage-splade_distil_cocodenser_medium
```

Alternatively, you can simply copy/paste from the commands below and obtain the same results.

## Indexing

Sample indexing command:

```
```bash
target/appassembler/bin/IndexCollection \
-collection JsonVectorCollection \
-input /path/to/msmarco-passage-splade_distil_cocodenser_medium \
Expand All @@ -72,7 +74,7 @@ The original data can be found [here](https://trec.nist.gov/data/deep2019.html).

After indexing has completed, you should be able to perform retrieval as follows:

```
```bash
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.msmarco-passage-splade_distil_cocodenser_medium/ \
-topics src/main/resources/topics-and-qrels/topics.dl19-passage.splade_distil_cocodenser_medium.tsv.gz \
Expand All @@ -83,7 +85,7 @@ target/appassembler/bin/SearchCollection \

Evaluation can be performed using `trec_eval`:

```
```bash
tools/eval/trec_eval.9.0.4/trec_eval -m map -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.dl19-passage.splade_distil_cocodenser_medium.txt
tools/eval/trec_eval.9.0.4/trec_eval -m ndcg_cut.10 -c src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.dl19-passage.splade_distil_cocodenser_medium.txt
tools/eval/trec_eval.9.0.4/trec_eval -m recall.100 -c -l 2 src/main/resources/topics-and-qrels/qrels.dl19-passage.txt runs/run.msmarco-passage-splade_distil_cocodenser_medium.splade_distil_cocodenser_medium.topics.dl19-passage.splade_distil_cocodenser_medium.txt
Expand Down
Loading

0 comments on commit dc07344

Please sign in to comment.