Skip to content

Commit

Permalink
Refactor IRST script (castorini#1130)
Browse files Browse the repository at this point in the history
* change bm25 to batch search and include auto ibm model downloading

* update unittest and documentation
  • Loading branch information
yuki617 authored May 3, 2022
1 parent 5707776 commit 1facc72
Show file tree
Hide file tree
Showing 4 changed files with 114 additions and 177 deletions.
59 changes: 7 additions & 52 deletions docs/experiments-msmarco-irst.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,42 +19,25 @@ As an alternative, we also make available pre-built indexes (in which case the i
The IBM model we used in this experiment is referenced in the Boytsov et al. [paper](https://arxiv.org/pdf/2102.06815.pdf)
Note that there is a separate guide for training the IBM Model on [FlexNeuART](https://github.com/oaqa/FlexNeuART/tree/master/demo)

Download trained IBM model:
```bash
mkdir irst_test/

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/ibm_model_1_bert_tok_20211117.tar.gz -P irst_test/
tar -xzvf irst_test/ibm_model_1_bert_tok_20211117.tar.gz -C irst_test
```

Download term freq statistics for wp collection:
```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/bert_wp_term_freq.msmarco-passage.20220411.pickle -P irst_test/
```

Next we can run our script to get our end-to-end results.
We can run our script to get our end-to-end results.

IRST (Sum)
```bash
python -m pyserini.search.lucene.irst \
--topics topics \
--translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
--index msmarco-v1-passage \
--output irst_test/regression_test_sum.irst_topics.trec \
--alpha 0.1 \
--wp-stat irst_test/bert_wp_term_freq.msmarco-passage.20220411.pickle
```

IRST (Max)
```bash
python -m pyserini.search.lucene.irst \
--topics topics \
--translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
--index msmarco-v1-passage \
--output irst_test/regression_test_max.irst_topics.trec \
--alpha 0.3 \
--max-sim \
--wp-stat irst_test/bert_wp_term_freq.msmarco-passage.20220411.pickle
```

For different topics, the `--topics` and `--irst_topics` are different, since Pyserini has all these topics available, we can pass in
Expand Down Expand Up @@ -97,43 +80,29 @@ python -m pyserini.eval.trec_eval -c -M 10 -m ndcg_cut.10 -m map -m recip_rank m
Now, we perform experiment on full document.
### Performing End-to-End Retrieval Using Already Trained Model

Download trained IBM models. Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.

```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/ibm_model_1_bert_tok_20211117.tar.gz -P irst_test/
tar -xzvf irst_test/ibm_model_1_bert_tok_20211117.tar.gz -C irst_test
```

Download term freq statistics for wp collection:
```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/bert_wp_term_freq.msmarco-doc.20220411.pickle -P irst_test/
```
Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.

Next we can run our script to get our retrieval results.

IRST (Sum)
```bash
python -m pyserini.search.lucene.irst \
--translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
--topics topics \
--index msmarco-v1-doc \
--output irst_test/regression_test_sum.irst_topics.trec \
--alpha 0.3 \
--hits 1000 \
--wp-stat irst_test/bert_wp_term_freq.msmarco-doc.20220411.pickle
--hits 1000
```

IRST (Max)
```bash
python -m pyserini.search.lucene.irst \
--translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
--topics topics \
--index msmarco-v1-doc \
--output irst_test/regression_test_max.irst_topics.trec \
--alpha 0.3 \
--hits 1000 \
--max-sim \
--wp-stat irst_test/bert_wp_term_freq.msmarco-doc.20220411.pickle
--max-sim
```


Expand Down Expand Up @@ -180,45 +149,31 @@ We utilized the MaxP technique during the ranking, that is scoring documents bas
### Performing End-to-End Retrieval Using Already Trained Model


Download trained IBM models. Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.

```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/ibm_model_1_bert_tok_20211117.tar.gz -P irst_test/
tar -xzvf irst_test/ibm_model_1_bert_tok_20211117.tar.gz -C irst_test
```

Download term freq statistics for wp collection:
```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/bert_wp_term_freq.msmarco-doc-segmented.20220411.pickle -P irst_test/
```
Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.

Next we can run our script to get our retrieval results.

IRST (Sum)
```bash
python -m pyserini.search.lucene.irst \
--translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
--topics topics \
--index msmarco-v1-doc-segmented \
--output irst_test/regression_test_sum.irst_topics.trec \
--alpha 0.3 \
--segments \
--hits 10000 \
--wp-stat irst_test/bert_wp_term_freq.msmarco-doc-segmented.20220411.pickle
--hits 10000
```

IRST (Max)
```bash
python -m pyserini.search.lucene.irst \
--translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
--topics topics \
--index msmarco-v1-doc-segmented \
--output irst_test/regression_test_max.irst_topics.trec \
--alpha 0.3 \
--hits 10000 \
--segments \
--max-sim \
--wp-stat irst_test/bert_wp_term_freq.msmarco-doc-segmented.20220411.pickle
--max-sim
```


Expand Down
Loading

0 comments on commit 1facc72

Please sign in to comment.