|
| 1 | +Subword Neural Machine Translation |
| 2 | +================================== |
| 3 | + |
| 4 | +This repository contains preprocessing scripts to segment text into subword |
| 5 | +units. The primary purpose is to facilitate the reproduction of our experiments |
| 6 | +on Neural Machine Translation with subword units (see below for reference). |
| 7 | + |
| 8 | +INSTALLATION |
| 9 | +------------ |
| 10 | + |
| 11 | +Clone or copy this repository and follow the usage instructions below. |
| 12 | + |
| 13 | +For an installable package, see https://github.com/rsennrich/subword-nmt/tree/package |
| 14 | + |
| 15 | + |
| 16 | +USAGE INSTRUCTIONS |
| 17 | +------------------ |
| 18 | + |
| 19 | +Check the individual files for usage instructions. |
| 20 | + |
| 21 | +To apply byte pair encoding to word segmentation, invoke these commands: |
| 22 | + |
| 23 | + ./learn_bpe.py -s {num_operations} < {train_file} > {codes_file} |
| 24 | + ./apply_bpe.py -c {codes_file} < {test_file} |
| 25 | + |
| 26 | +To segment rare words into character n-grams, do the following: |
| 27 | + |
| 28 | + ./get_vocab.py < {train_file} > {vocab_file} |
| 29 | + ./segment-char-ngrams.py --vocab {vocab_file} -n {order} --shortlist {size} < {test_file} |
| 30 | + |
| 31 | +The original segmentation can be restored with a simple replacement: |
| 32 | + |
| 33 | + sed -r 's/(@@ )|(@@ ?$)//g' |
| 34 | + |
| 35 | + |
| 36 | +BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT |
| 37 | +-------------------------------------------------- |
| 38 | + |
| 39 | +We found that for languages that share an alphabet, learning BPE on the |
| 40 | +concatenation of the (two or more) involved languages increases the consistency |
| 41 | +of segmentation, and reduces the problem of inserting/deleting characters when |
| 42 | +copying/transliterating names. |
| 43 | + |
| 44 | +However, this introduces undesirable edge cases in that a word may be segmented |
| 45 | +in a way that has only been observed in the other language, and is thus unknown |
| 46 | +at test time. To prevent this, `apply_bpe.py` accepts a `--vocabulary` and a |
| 47 | +`--vocabulary-threshold` option so that the script will only produce symbols |
| 48 | +which also appear in the vocabulary (with at least some frequency). |
| 49 | + |
| 50 | +To use this functionality, we recommend the following recipe (assuming L1 and L2 |
| 51 | +are the two languages): |
| 52 | + |
| 53 | +Learn byte pair encoding on the concatenation of the training text, and get resulting vocabulary for each: |
| 54 | + |
| 55 | + cat {train_file}.L1 {train_file}.L2 | ./learn_bpe.py -s {num_operations} -o {codes_file} |
| 56 | + ./apply_bpe.py -c {codes_file} < {train_file}.L1 | ./get_vocab.py > {vocab_file}.L1 |
| 57 | + ./apply_bpe.py -c {codes_file} < {train_file}.L2 | ./get_vocab.py > {vocab_file}.L2 |
| 58 | + |
| 59 | +more conventiently, you can do the same with with this command: |
| 60 | + |
| 61 | + ./learn_joint_bpe_and_vocab.py --input {train_file}.L1 {train_file}.L2 -s {num_operations} -o {codes_file} --write-vocabulary {vocab_file}.L1 {vocab_file}.L2 |
| 62 | + |
| 63 | +re-apply byte pair encoding with vocabulary filter: |
| 64 | + |
| 65 | + ./apply_bpe.py -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {train_file}.L1 > {train_file}.BPE.L1 |
| 66 | + ./apply_bpe.py -c {codes_file} --vocabulary {vocab_file}.L2 --vocabulary-threshold 50 < {train_file}.L2 > {train_file}.BPE.L2 |
| 67 | + |
| 68 | +as a last step, extract the vocabulary to be used by the neural network. Example with Nematus: |
| 69 | + |
| 70 | + nematus/data/build_dictionary.py {train_file}.BPE.L1 {train_file}.BPE.L2 |
| 71 | + |
| 72 | +[you may want to take the union of all vocabularies to support multilingual systems] |
| 73 | + |
| 74 | +for test/dev data, re-use the same options for consistency: |
| 75 | + |
| 76 | + ./apply_bpe.py -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {test_file}.L1 > {test_file}.BPE.L1 |
| 77 | + |
| 78 | + |
| 79 | +PUBLICATIONS |
| 80 | +------------ |
| 81 | + |
| 82 | +The segmentation methods are described in: |
| 83 | + |
| 84 | +Rico Sennrich, Barry Haddow and Alexandra Birch (2016): |
| 85 | + Neural Machine Translation of Rare Words with Subword Units |
| 86 | + Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany. |
| 87 | + |
| 88 | +ACKNOWLEDGMENTS |
| 89 | +--------------- |
| 90 | +This project has received funding from Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland, and from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21). |
0 commit comments