Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for readable lowercase topics #2393

Merged
merged 4 commits into from
Feb 26, 2024
Merged

Add support for readable lowercase topics #2393

merged 4 commits into from
Feb 26, 2024

Conversation

lintool
Copy link
Member

@lintool lintool commented Feb 25, 2024

For something like:

java -cp anserini-0.24.1-fatjar.jar io.anserini.search.SearchCollection \
  -index msmarco-v1-passage \
  -topics MSMARCO_PASSAGE_DEV_SUBSET \
  -output run.msmarco-passage.bm25.txt \
  -threads 16 \
  -bm25

We can refer to msmarco-passage-dev instead of the full Enum MSMARCO_PASSAGE_DEV_SUBSET.

Copy link

codecov bot commented Feb 26, 2024

Codecov Report

Attention: Patch coverage is 98.03922% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 66.40%. Comparing base (eacd135) to head (97cfa9e).

Files Patch % Lines
...main/java/io/anserini/search/SearchCollection.java 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2393      +/-   ##
============================================
+ Coverage     66.12%   66.40%   +0.27%     
- Complexity     1405     1408       +3     
============================================
  Files           207      207              
  Lines         11724    11820      +96     
  Branches       1473     1476       +3     
============================================
+ Hits           7753     7849      +96     
+ Misses         3461     3460       -1     
- Partials        510      511       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lintool
Copy link
Member Author

lintool commented Feb 26, 2024

Runs with ONNX:

target/appassembler/bin/SearchCollection       -index msmarco-v1-passage                            -topics msmarco-v1-passage-dev                                       -output runs/run.msmarco-passage.bm25.txt                          -threads 16 -bm25
target/appassembler/bin/SearchCollection       -index msmarco-v1-passage-splade-pp-ed               -topics msmarco-v1-passage-dev -encoder SpladePlusPlusEnsembleDistil -output runs/run.msmarco-passage.splade-pp-ed.onnx.txt             -threads 16 -impact -pretokenized
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil             -topics msmarco-v1-passage-dev -encoder CosDprDistil                 -output runs/run.msmarco-passage.cos-dpr-distil.onnx.txt           -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil-quantized   -topics msmarco-v1-passage-dev -encoder CosDprDistil                 -output runs/run.msmarco-passage.cos-dpr-distil-quantized.onnx.txt -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5           -topics msmarco-v1-passage-dev -encoder BgeBaseEn15                  -output runs/run.msmarco-passage.bge.onnx.txt                      -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5-quantized -topics msmarco-v1-passage-dev -encoder BgeBaseEn15                  -output runs/run.msmarco-passage.bge-quantized.onnx.txt            -threads 16 -efSearch 1000

target/appassembler/bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bm25.txt
target/appassembler/bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.splade-pp-ed.onnx.txt
target/appassembler/bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.cos-dpr-distil.onnx.txt
target/appassembler/bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.cos-dpr-distil-quantized.onnx.txt
target/appassembler/bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bge.onnx.txt
target/appassembler/bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bge-quantized.onnx.txt

target/appassembler/bin/SearchCollection       -index msmarco-v1-passage                            -topics dl19-passage                                       -output runs/run.dl19.bm25.txt                          -threads 16 -bm25
target/appassembler/bin/SearchCollection       -index msmarco-v1-passage-splade-pp-ed               -topics dl19-passage -encoder SpladePlusPlusEnsembleDistil -output runs/run.dl19.splade-pp-ed.onnx.txt             -threads 16 -impact -pretokenized
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil             -topics dl19-passage -encoder CosDprDistil                 -output runs/run.dl19.cos-dpr-distil.onnx.txt           -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil-quantized   -topics dl19-passage -encoder CosDprDistil                 -output runs/run.dl19.cos-dpr-distil-quantized.onnx.txt -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5           -topics dl19-passage -encoder BgeBaseEn15                  -output runs/run.dl19.bge.onnx.txt                      -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5-quantized -topics dl19-passage -encoder BgeBaseEn15                  -output runs/run.dl19.bge-quantized.onnx.txt            -threads 16 -efSearch 1000

target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl19-passage.txt runs/run.dl19.bm25.txt
target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl19-passage.txt runs/run.dl19.splade-pp-ed.onnx.txt
target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl19-passage.txt runs/run.dl19.cos-dpr-distil.onnx.txt
target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl19-passage.txt runs/run.dl19.cos-dpr-distil-quantized.onnx.txt
target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl19-passage.txt runs/run.dl19.bge.onnx.txt
target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl19-passage.txt runs/run.dl19.bge-quantized.onnx.txt

target/appassembler/bin/SearchCollection       -index msmarco-v1-passage                            -topics dl20-passage                                       -output runs/run.dl20.bm25.txt                          -threads 16 -bm25
target/appassembler/bin/SearchCollection       -index msmarco-v1-passage-splade-pp-ed               -topics dl20-passage -encoder SpladePlusPlusEnsembleDistil -output runs/run.dl20.splade-pp-ed.onnx.txt             -threads 16 -impact -pretokenized
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil             -topics dl20-passage -encoder CosDprDistil                 -output runs/run.dl20.cos-dpr-distil.onnx.txt           -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil-quantized   -topics dl20-passage -encoder CosDprDistil                 -output runs/run.dl20.cos-dpr-distil-quantized.onnx.txt -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5           -topics dl20-passage -encoder BgeBaseEn15                  -output runs/run.dl20.bge.onnx.txt                      -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5-quantized -topics dl20-passage -encoder BgeBaseEn15                  -output runs/run.dl20.bge-quantized.onnx.txt            -threads 16 -efSearch 1000

target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl20-passage.txt runs/run.dl20.bm25.txt
target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl20-passage.txt runs/run.dl20.splade-pp-ed.onnx.txt
target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl20-passage.txt runs/run.dl20.cos-dpr-distil.onnx.txt
target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl20-passage.txt runs/run.dl20.cos-dpr-distil-quantized.onnx.txt
target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl20-passage.txt runs/run.dl20.bge.onnx.txt
target/appassembler/bin/trec_eval -m ndcg_cut.10 -c tools/topics-and-qrels/qrels.dl20-passage.txt runs/run.dl20.bge-quantized.onnx.txt

Results:

                    dev    DL19     DL20
                 MRR@10  NDCG@10  NDCG@10
BM25             0.1840   0.5058   0.4796
SPLADE++ ED      0.3828   0.7308   0.7197
cos-DPR          0.3887   0.7250   0.7025
cos-DPR (int8)   0.3899   0.7247   0.6996
BGE              0.3575   0.7016   0.6768
BGE (int8)       0.3575   0.7017   0.6767

@lintool
Copy link
Member Author

lintool commented Feb 26, 2024

Runs with pre-encoded queries:

target/appassembler/bin/SearchCollection       -index msmarco-v1-passage                            -topics msmarco-v1-passage-dev                  -output runs/run.msmarco-passage.bm25.txt                     -threads 16 -bm25
target/appassembler/bin/SearchCollection       -index msmarco-v1-passage-splade-pp-ed               -topics msmarco-v1-passage-dev-splade-pp-ed     -output runs/run.msmarco-passage.splade-pp-ed.txt             -threads 16 -impact -pretokenized
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil             -topics msmarco-v1-passage-dev-cos-dpr-distil   -output runs/run.msmarco-passage.cos-dpr-distil.txt           -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil-quantized   -topics msmarco-v1-passage-dev-cos-dpr-distil   -output runs/run.msmarco-passage.cos-dpr-distil-quantized.txt -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5           -topics msmarco-v1-passage-dev-bge-base-en-v1.5 -output runs/run.msmarco-passage.bge.txt                      -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5-quantized -topics msmarco-v1-passage-dev-bge-base-en-v1.5 -output runs/run.msmarco-passage.bge-quantized.txt            -threads 16 -efSearch 1000

target/appassembler/bin/SearchCollection       -index msmarco-v1-passage                            -topics dl19-passage                  -output runs/run.dl19.bm25.txt                     -threads 16 -bm25
target/appassembler/bin/SearchCollection       -index msmarco-v1-passage-splade-pp-ed               -topics dl19-passage-splade-pp-ed     -output runs/run.dl19.splade-pp-ed.txt             -threads 16 -impact -pretokenized
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil             -topics dl19-passage-cos-dpr-distil   -output runs/run.dl19.cos-dpr-distil.txt           -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil-quantized   -topics dl19-passage-cos-dpr-distil   -output runs/run.dl19.cos-dpr-distil-quantized.txt -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5           -topics dl19-passage-bge-base-en-v1.5 -output runs/run.dl19.bge.txt                      -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5-quantized -topics dl19-passage-bge-base-en-v1.5 -output runs/run.dl19.bge-quantized.txt            -threads 16 -efSearch 1000

target/appassembler/bin/SearchCollection       -index msmarco-v1-passage                            -topics dl20-passage                  -output runs/run.dl20.bm25.txt                     -threads 16 -bm25
target/appassembler/bin/SearchCollection       -index msmarco-v1-passage-splade-pp-ed               -topics dl20-passage-splade-pp-ed     -output runs/run.dl20.splade-pp-ed.txt             -threads 16 -impact -pretokenized
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil             -topics dl20-passage-cos-dpr-distil   -output runs/run.dl20.cos-dpr-distil.txt           -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-cos-dpr-distil-quantized   -topics dl20-passage-cos-dpr-distil   -output runs/run.dl20.cos-dpr-distil-quantized.txt -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5           -topics dl20-passage-bge-base-en-v1.5 -output runs/run.dl20.bge.txt                      -threads 16 -efSearch 1000
target/appassembler/bin/SearchHnswDenseVectors -index msmarco-v1-passage-bge-base-en-v1.5-quantized -topics dl20-passage-bge-base-en-v1.5 -output runs/run.dl20.bge-quantized.txt            -threads 16 -efSearch 1000

Results:

                    dev    DL19     DL20
                 MRR@10  NDCG@10  NDCG@10
BM25             0.1840   0.5058   0.4796
SPLADE++ ED      0.3830   0.7317   0.7198
cos-DPR          0.3887   0.7250   0.7025
cos-DPR (int8)   0.3897   0.7240   0.7004
BGE              0.3574   0.7065   0.6780
BGE (int8)       0.3572   0.7016   0.6738

@lintool lintool merged commit 643af14 into master Feb 26, 2024
3 checks passed
@lintool lintool deleted the readable-topics branch February 26, 2024 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants