A parallel sentence preprocessing toolkit
The codebase uses python-fire
to have a flexible, pipelined CLI interface.
The module parasite.pipeline
implements CLI over all the basic concepts of the codebase.
We recommend using AlignedBiText from_files
for working with a single bi-text document or AlignedBiText batch_from_files
to work with multiple files.
Here is effect of using different components as part of preprocessing, filtering and monotonic alignments pipeline.
All the numbers represent the BLEU score on the WMT20 MEDLINE (local test) set for different data preprocessing configurations (and the exact same architecture and learning parameters).
Model | en → ru | ru → en |
---|---|---|
baseline configuration | 30.7 | 31.3 |
+ greedy alignments | 30.1 | 31.8 |
+ detect subsection names | 30.7 | 32.3 |
+ remove titles | 31.3 | 32.5 |
+ optimize total similarity | 30.4 | 32.2 |
+ normalize distance matrix | 30.8 | 32.1 |
+ penalize source/target ratio | 31.2 | 31.5 |
+ one-to-many (K=3) | 32.2 | 32.3 |
Here are training graphs on Aim for en → ru, averaged across different runs. The best configuration shows significantly better results. Play with experiments here:
To replicate our best submission (run 2) (WMT20 Biomedical Translation Task winner models for en-ru
language pair) preprocessing, please run:
python -m parasite.pipeline \
AlignedBiText batch_from_files /datasets/wmt20.biomed.ru-en.medline_train/raw_files/*_en.txt \
--suffix=".txt" --src-lang="en" --tgt-lang="ru" \
- apply segmenter reset \
- apply segmenter scispacy --only-src \
- apply segmenter razdel --only-tgt \
- apply segmenter remove-title --only-tgt --blacklist='Резюме' \
- apply segmenter keyword --only-src --path='examples/medline_keywords/eng_few.txt' \
- apply segmenter keyword --only-tgt --path='examples/medline_keywords/rus_few.txt' \
- apply segmenter remove-title --only-src \
- apply encoder pretrained-transformer "xlm-roberta-large" \
--normalize=2 --force-lowercase --normalize-length=avg --fp16 \
- apply aligner greedy-one2one --distance=euclidean \
--progress \
- split --mapping-path="examples/wmt20.biomed.ru-en.medline_train.yerevann.splits.txt" \
- to_files --output-dir="/datasets/wmt20.biomed.ru-en.medline_train/preprocessed_files"
In order to replicate our overall best preprocessing (not submitted, described in the paper), you can run:
python -m parasite.pipeline \
AlignedBiText batch_from_files /datasets/wmt20.biomed.ru-en.medline_train/raw_files/*_en.txt \
--suffix=".txt" --src-lang="en" --tgt-lang="ru" \
- apply segmenter reset \
- apply segmenter syntok \
- apply segmenter remove-title --only-tgt --blacklist='Резюме' \
- apply segmenter keyword --only-src --path='examples/medline_keywords/eng.txt' \
- apply segmenter keyword --only-tgt --path='examples/medline_keywords/rus.txt' \
- apply segmenter remove-title --only-src \
- apply encoder pretrained-transformer "xlm-roberta-large" \
--encode-windows=3 --normalize-length=avg --fp16 \
- apply aligner dynamic \
--max-k=3 --penalty-ratio=2 --distance=euclidean --normalize=True \
--progress \
- split --mapping-path="examples/wmt20.biomed.ru-en.medline_train.yerevann.splits.txt" \
- to_files --output-dir="/datasets/wmt20.biomed.ru-en.medline_train/preprocessed_files"
In order to cite our work, please consider the following BibTeX:
@inproceedings{hambardzumyan-etal-2020-yerevanns,
title = "{Y}ereva{NN}{'}s Systems for {WMT}20 Biomedical Translation Task: The Effect of Fixing Misaligned Sentence Pairs",
author = "Hambardzumyan, Karen and
Tamoyan, Hovhannes and
Khachatrian, Hrant",
booktitle = "Proceedings of the Fifth Conference on Machine Translation",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.wmt-1.88",
pages = "820--825",
}