Skip to content

Commit

Permalink
Add regression for MS MARCO doc with per-passage docTTTTTquery expans…
Browse files Browse the repository at this point in the history
…ions (castorini#1414)
  • Loading branch information
lintool authored Nov 16, 2020
1 parent b012311 commit f87c945
Show file tree
Hide file tree
Showing 7 changed files with 171 additions and 3 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ For the most part, these runs are based on [_default_ parameter settings](https:
+ [Regressions for MS MARCO Passage Ranking with docTTTTTquery expansion](docs/regressions-msmarco-passage-docTTTTTquery.md)
+ [Regressions for MS MARCO Document Ranking ](docs/regressions-msmarco-doc.md)
+ [Regressions for MS MARCO Document Ranking with per-doc docTTTTTquery expansion](docs/regressions-msmarco-doc-docTTTTTquery-per-doc.md)
+ [Regressions for MS MARCO Document Ranking with per-passage docTTTTTquery expansion](docs/regressions-msmarco-doc-docTTTTTquery-per-passage.md)
+ [Regressions for the TREC 2019 Deep Learning Track (Passage Ranking Task)](docs/regressions-dl19-passage.md)
+ [Regressions for the TREC 2019 Deep Learning Track (Document Ranking Task)](docs/regressions-dl19-doc.md)
+ [Regressions for the TREC 2018 News Track (Background Linking Task)](docs/regressions-backgroundlinking18.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/regressions-msmarco-doc-docTTTTTquery-per-doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection \
>& logs/log.msmarco-doc-docTTTTTquery-per-doc &
```

The directory `/path/to/msmarco-doc-docTTTTTquery-per-doc/` should be a directory containing the official document collection (a single file), in TREC format.
The directory `/path/to/msmarco-doc-docTTTTTquery-per-doc/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#replicating-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection.

For additional details, see explanation of [common indexing options](common-indexing-options.md).

Expand Down
60 changes: 60 additions & 0 deletions docs/regressions-msmarco-doc-docTTTTTquery-per-passage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Anserini: Regressions for MS MARCO Document Ranking

This page documents regression experiments for the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking) with per-passage docTTTTTquery document expansion, which is integrated into Anserini's regression testing framework.
For more complete instructions on how to run end-to-end experiments, refer to [this page](https://github.com/castorini/docTTTTTquery#Replicating-MS-MARCO-Document-Ranking-Results-with-Anserini).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-docTTTTTquery-per-passage.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-passage.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing

Typical indexing command:

```
nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection \
-input /path/to/msmarco-doc-docTTTTTquery-per-passage \
-index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage.pos+docvectors+raw \
-generator DefaultLuceneDocumentGenerator \
-threads 1 -storePositions -storeDocvectors -storeRaw \
>& logs/log.msmarco-doc-docTTTTTquery-per-passage &
```

The directory `/path/to/msmarco-doc-docTTTTTquery-per-passage/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#replicating-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection.

For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/).
The regression experiments here evaluate on the 5193 dev set questions.

After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage.pos+docvectors+raw \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default.topics.msmarco-doc.dev.txt \
-bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000 &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -m map -c -m recall.1000 -c src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-default.topics.msmarco-doc.dev.txt
```

## Effectiveness

With the above commands, you should be able to replicate the following results:

MAP | BM25 (Default)|
:---------------------------------------|-----------|
[MS MARCO Document Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.3182 |


R@1000 | BM25 (Default)|
:---------------------------------------|-----------|
[MS MARCO Document Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.9490 |

See [this page](https://github.com/castorini/docTTTTTquery#Replicating-MS-MARCO-Document-Ranking-Results-with-Anserini) for more details.
Note that here we are using `trec_eval` to evaluate the top 1000 hits for each query; beware, the runs provided by MS MARCO organizers for reranking have only 100 hits per query.
4 changes: 3 additions & 1 deletion docs/regressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ nohup python src/main/python/run_regression.py --collection msmarco-passage-doc2
nohup python src/main/python/run_regression.py --collection msmarco-passage-docTTTTTquery >& logs/log.msmarco-passage-docTTTTTquery &
nohup python src/main/python/run_regression.py --collection msmarco-doc >& logs/log.msmarco-doc &
nohup python src/main/python/run_regression.py --collection msmarco-doc-docTTTTTquery-per-doc >& logs/log.msmarco-doc-docTTTTTquery-per-doc &
nohup python src/main/python/run_regression.py --collection msmarco-doc-docTTTTTquery-per-passage >& logs/log.msmarco-doc-docTTTTTquery-per-passage &
nohup python src/main/python/run_regression.py --collection dl19-passage >& logs/log.dl19-passage &
nohup python src/main/python/run_regression.py --collection dl19-doc >& logs/log.dl19-doc &
Expand Down Expand Up @@ -98,7 +99,8 @@ nohup python src/main/python/run_regression.py --index --collection msmarco-pass
nohup python src/main/python/run_regression.py --index --collection msmarco-passage-doc2query >& logs/log.msmarco-passage-doc2query &
nohup python src/main/python/run_regression.py --index --collection msmarco-passage-docTTTTTquery >& logs/log.msmarco-passage-docTTTTTquery &
nohup python src/main/python/run_regression.py --index --collection msmarco-doc >& logs/log.msmarco-doc &
nohup python src/main/python/run_regression.py --index --collection msmarco-doc-docTTTTTquery-per-doc >& logs/logs/log.msmarco-doc-docTTTTTquery-per-doc &
nohup python src/main/python/run_regression.py --index --collection msmarco-doc-docTTTTTquery-per-doc >& logs/log.msmarco-doc-docTTTTTquery-per-doc &
nohup python src/main/python/run_regression.py --index --collection msmarco-doc-docTTTTTquery-per-passage >& logs/log.msmarco-doc-docTTTTTquery-per-passage &
nohup python src/main/python/run_regression.py --index --collection dl19-passage >& logs/log.dl19-passage &
nohup python src/main/python/run_regression.py --index --collection dl19-doc >& logs/log.dl19-doc &
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Typical indexing command:
${index_cmds}
```

The directory `/path/to/msmarco-doc-docTTTTTquery-per-doc/` should be a directory containing the official document collection (a single file), in TREC format.
The directory `/path/to/msmarco-doc-docTTTTTquery-per-doc/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#replicating-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection.

For additional details, see explanation of [common indexing options](common-indexing-options.md).

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Anserini: Regressions for MS MARCO Document Ranking

This page documents regression experiments for the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking) with per-passage docTTTTTquery document expansion, which is integrated into Anserini's regression testing framework.
For more complete instructions on how to run end-to-end experiments, refer to [this page](https://github.com/castorini/docTTTTTquery#Replicating-MS-MARCO-Document-Ranking-Results-with-Anserini).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-docTTTTTquery-per-passage.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-passage.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing

Typical indexing command:

```
${index_cmds}
```

The directory `/path/to/msmarco-doc-docTTTTTquery-per-passage/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#replicating-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection.

For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/).
The regression experiments here evaluate on the 5193 dev set questions.

After indexing has completed, you should be able to perform retrieval as follows:

```
${ranking_cmds}
```

Evaluation can be performed using `trec_eval`:

```
${eval_cmds}
```

## Effectiveness

With the above commands, you should be able to replicate the following results:

${effectiveness}

See [this page](https://github.com/castorini/docTTTTTquery#Replicating-MS-MARCO-Document-Ranking-Results-with-Anserini) for more details.
Note that here we are using `trec_eval` to evaluate the top 1000 hits for each query; beware, the runs provided by MS MARCO organizers for reranking have only 100 hits per query.
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
name: msmarco-doc-docTTTTTquery-per-passage
index_command: target/appassembler/bin/IndexCollection
index_utils_command: target/appassembler/bin/IndexReaderUtils
search_command: target/appassembler/bin/SearchCollection
topic_root: src/main/resources/topics-and-qrels/
qrels_root: src/main/resources/topics-and-qrels/
index_root:
ranking_root:
collection: JsonCollection
generator: DefaultLuceneDocumentGenerator
threads: 1
index_options:
- -storePositions
- -storeDocvectors
- -storeRaw
topic_reader: TsvInt
evals:
- command: tools/eval/trec_eval.9.0.4/trec_eval
params:
- -m map
- -c
separator: "\t"
parse_index: 2
metric: map
metric_precision: 4
can_combine: true
- command: tools/eval/trec_eval.9.0.4/trec_eval
params:
- -m recall.1000
- -c
separator: "\t"
parse_index: 2
metric: R@1000
metric_precision: 4
can_combine: true
input_roots:
- /tuna1/ # on tuna
- /store/ # on orca
- /scratch2/ # on damiano
input: collections/msmarco/doc-docTTTTTquery-per-passage
index_path: indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage.pos+docvectors+raw
index_stats:
documents: 20544550
documents (non-empty): 20544550
total terms: 4203956960
topics:
- name: "[MS MARCO Document Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Document-Ranking)"
path: topics.msmarco-doc.dev.txt
qrel: qrels.msmarco-doc.dev.txt
models:
- name: bm25-default
display: BM25 (Default)
params:
- -bm25 -hits 10000 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000
results:
map:
- 0.3182
R@1000:
- 0.9490

0 comments on commit f87c945

Please sign in to comment.