Refactor IRST script (castorini#1130)

* change bm25 to batch search and include auto ibm model downloading * update unittest and documentation
crystina-z · May 3, 2022 · 1facc72 · 1facc72
1 parent 5707776
commit 1facc72
Show file tree

Hide file tree

Showing 4 changed files with 114 additions and 177 deletions.
diff --git a/docs/experiments-msmarco-irst.md b/docs/experiments-msmarco-irst.md
@@ -19,42 +19,25 @@ As an alternative, we also make available pre-built indexes (in which case the i
 The IBM model we used in this experiment is referenced in the Boytsov et al. [paper](https://arxiv.org/pdf/2102.06815.pdf)
 Note that there is a separate guide for training the IBM Model on [FlexNeuART](https://github.com/oaqa/FlexNeuART/tree/master/demo)
 
-Download trained IBM model:
-```bash
-mkdir irst_test/
-
-wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/ibm_model_1_bert_tok_20211117.tar.gz -P irst_test/
-tar -xzvf irst_test/ibm_model_1_bert_tok_20211117.tar.gz -C irst_test
-```
-
-Download term freq statistics for wp collection:
-```bash
-wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/bert_wp_term_freq.msmarco-passage.20220411.pickle -P irst_test/
-```
-
-Next we can run our script to get our end-to-end results.
+We can run our script to get our end-to-end results.
 
 IRST (Sum) 
 ```bash
 python -m pyserini.search.lucene.irst \
   --topics topics \
-  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
   --index msmarco-v1-passage \
   --output irst_test/regression_test_sum.irst_topics.trec \
   --alpha 0.1 \
-  --wp-stat irst_test/bert_wp_term_freq.msmarco-passage.20220411.pickle
 ```
 
 IRST (Max)
 ```bash
 python -m pyserini.search.lucene.irst \
   --topics topics \
-  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
   --index msmarco-v1-passage \
   --output irst_test/regression_test_max.irst_topics.trec \
   --alpha 0.3 \
   --max-sim \
-  --wp-stat irst_test/bert_wp_term_freq.msmarco-passage.20220411.pickle
 ```
 
 For different topics, the `--topics` and `--irst_topics` are different, since Pyserini has all these topics available, we can pass in
@@ -97,43 +80,29 @@ python -m pyserini.eval.trec_eval -c -M 10 -m ndcg_cut.10 -m map -m recip_rank m
 Now, we perform experiment on full document.
 ### Performing End-to-End Retrieval Using Already Trained Model
 
-Download trained IBM models. Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.
-
-```bash
-wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/ibm_model_1_bert_tok_20211117.tar.gz -P irst_test/
-tar -xzvf irst_test/ibm_model_1_bert_tok_20211117.tar.gz -C irst_test
-```
-
-Download term freq statistics for wp collection:
-```bash
-wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/bert_wp_term_freq.msmarco-doc.20220411.pickle -P irst_test/
-```
+Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.
 
 Next we can run our script to get our retrieval results.
 
 IRST (Sum) 
 ```bash
 python -m pyserini.search.lucene.irst \
-  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
   --topics topics \
   --index msmarco-v1-doc \
   --output irst_test/regression_test_sum.irst_topics.trec \
   --alpha 0.3 \
-  --hits 1000 \
-  --wp-stat irst_test/bert_wp_term_freq.msmarco-doc.20220411.pickle
+  --hits 1000 
 ```
 
 IRST (Max)
 ```bash
 python -m pyserini.search.lucene.irst \
-  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
   --topics topics \
   --index msmarco-v1-doc \
   --output irst_test/regression_test_max.irst_topics.trec \
   --alpha 0.3 \
   --hits 1000 \
-  --max-sim \
-  --wp-stat irst_test/bert_wp_term_freq.msmarco-doc.20220411.pickle
+  --max-sim
 ```
 
 
@@ -180,45 +149,31 @@ We utilized the MaxP technique during the ranking, that is scoring documents bas
 ### Performing End-to-End Retrieval Using Already Trained Model
 
 
-Download trained IBM models. Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.
-
-```bash
-wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/ibm_model_1_bert_tok_20211117.tar.gz -P irst_test/
-tar -xzvf irst_test/ibm_model_1_bert_tok_20211117.tar.gz -C irst_test
-```
-
-Download term freq statistics for wp collection:
-```bash
-wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/bert_wp_term_freq.msmarco-doc-segmented.20220411.pickle -P irst_test/
-```
+Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.
 
 Next we can run our script to get our retrieval results.
 
 IRST (Sum) 
 ```bash
 python -m pyserini.search.lucene.irst \
-  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
   --topics topics \
   --index msmarco-v1-doc-segmented \
   --output irst_test/regression_test_sum.irst_topics.trec \
   --alpha 0.3 \
   --segments \
-  --hits 10000 \
-  --wp-stat irst_test/bert_wp_term_freq.msmarco-doc-segmented.20220411.pickle
+  --hits 10000
 ```
 
 IRST (Max)
 ```bash
 python -m pyserini.search.lucene.irst \
-  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
   --topics topics \
   --index msmarco-v1-doc-segmented \
   --output irst_test/regression_test_max.irst_topics.trec \
   --alpha 0.3 \
   --hits 10000 \
   --segments \
-  --max-sim \
-  --wp-stat irst_test/bert_wp_term_freq.msmarco-doc-segmented.20220411.pickle
+  --max-sim
 ```