Skip to content

Commit

Permalink
Change sw and te tokenizers for Mr.TyDi (#1727)
Browse files Browse the repository at this point in the history
Note that we do not have language-specific tokenizers for sw and te, so we use whitespace, which is exactly 
the same as -pretokenized. Using -language xx is conceptually cleaner, but doesn't change results.
  • Loading branch information
lintool authored Jan 10, 2022
1 parent ed88453 commit cee7dfc
Show file tree
Hide file tree
Showing 4 changed files with 12 additions and 12 deletions.
8 changes: 4 additions & 4 deletions docs/regressions-mrtydi-v1.1-sw.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ target/appassembler/bin/IndexCollection \
-input /path/to/mrtydi-v1.1-sw \
-index indexes/lucene-index.mrtydi-v1.1-swahili/ \
-generator DefaultLuceneDocumentGenerator \
-threads 1 -storePositions -storeDocvectors -storeRaw -pretokenized \
-threads 1 -storePositions -storeDocvectors -storeRaw -language sw \
>& logs/log.mrtydi-v1.1-sw &
```

Expand All @@ -31,17 +31,17 @@ target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.mrtydi-v1.1-swahili/ \
-topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-sw.train.txt.gz -topicreader TsvInt \
-output runs/run.mrtydi-v1.1-sw.bm25.topics.mrtydi-v1.1-sw.train.txt.gz \
-bm25 -hits 100 -pretokenized &
-bm25 -hits 100 -language sw &
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.mrtydi-v1.1-swahili/ \
-topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-sw.dev.txt.gz -topicreader TsvInt \
-output runs/run.mrtydi-v1.1-sw.bm25.topics.mrtydi-v1.1-sw.dev.txt.gz \
-bm25 -hits 100 -pretokenized &
-bm25 -hits 100 -language sw &
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.mrtydi-v1.1-swahili/ \
-topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-sw.test.txt.gz -topicreader TsvInt \
-output runs/run.mrtydi-v1.1-sw.bm25.topics.mrtydi-v1.1-sw.test.txt.gz \
-bm25 -hits 100 -pretokenized &
-bm25 -hits 100 -language sw &
```

Evaluation can be performed using `trec_eval`:
Expand Down
8 changes: 4 additions & 4 deletions docs/regressions-mrtydi-v1.1-te.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ target/appassembler/bin/IndexCollection \
-input /path/to/mrtydi-v1.1-te \
-index indexes/lucene-index.mrtydi-v1.1-telugu/ \
-generator DefaultLuceneDocumentGenerator \
-threads 1 -storePositions -storeDocvectors -storeRaw -pretokenized \
-threads 1 -storePositions -storeDocvectors -storeRaw -language te \
>& logs/log.mrtydi-v1.1-te &
```

Expand All @@ -31,17 +31,17 @@ target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.mrtydi-v1.1-telugu/ \
-topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-te.train.txt.gz -topicreader TsvInt \
-output runs/run.mrtydi-v1.1-te.bm25.topics.mrtydi-v1.1-te.train.txt.gz \
-bm25 -hits 100 -pretokenized &
-bm25 -hits 100 -language te &
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.mrtydi-v1.1-telugu/ \
-topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-te.dev.txt.gz -topicreader TsvInt \
-output runs/run.mrtydi-v1.1-te.bm25.topics.mrtydi-v1.1-te.dev.txt.gz \
-bm25 -hits 100 -pretokenized &
-bm25 -hits 100 -language te &
target/appassembler/bin/SearchCollection \
-index indexes/lucene-index.mrtydi-v1.1-telugu/ \
-topics src/main/resources/topics-and-qrels/topics.mrtydi-v1.1-te.test.txt.gz -topicreader TsvInt \
-output runs/run.mrtydi-v1.1-te.bm25.topics.mrtydi-v1.1-te.test.txt.gz \
-bm25 -hits 100 -pretokenized &
-bm25 -hits 100 -language te &
```

Evaluation can be performed using `trec_eval`:
Expand Down
4 changes: 2 additions & 2 deletions src/main/resources/regression/mrtydi-v1.1-sw.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ index_path: indexes/lucene-index.mrtydi-v1.1-swahili/
collection_class: MrTyDiCollection
generator_class: DefaultLuceneDocumentGenerator
index_threads: 1
index_options: -storePositions -storeDocvectors -storeRaw -pretokenized
index_options: -storePositions -storeDocvectors -storeRaw -language sw
index_stats:
documents: 136689
documents (non-empty): 136689
Expand Down Expand Up @@ -48,7 +48,7 @@ topics:
models:
- name: bm25
display: BM25
params: -bm25 -hits 100 -pretokenized
params: -bm25 -hits 100 -language sw
results:
MRR@100:
- 0.2610
Expand Down
4 changes: 2 additions & 2 deletions src/main/resources/regression/mrtydi-v1.1-te.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ index_path: indexes/lucene-index.mrtydi-v1.1-telugu/
collection_class: MrTyDiCollection
generator_class: DefaultLuceneDocumentGenerator
index_threads: 1
index_options: -storePositions -storeDocvectors -storeRaw -pretokenized
index_options: -storePositions -storeDocvectors -storeRaw -language te
index_stats:
documents: 548224
documents (non-empty): 548224
Expand Down Expand Up @@ -48,7 +48,7 @@ topics:
models:
- name: bm25
display: BM25
params: -bm25 -hits 100 -pretokenized
params: -bm25 -hits 100 -language te
results:
MRR@100:
- 0.2847
Expand Down

0 comments on commit cee7dfc

Please sign in to comment.