TWB neural machine translation pipeline
This repository contains the necessary scripts for training and evaluating a neural machine translation system from scratch. It serves both as a template and a memory of experiments for various language pairs within TWB's Gamayun initiative.
Master branch is kept as a template for training a toy model of Tigrinya to English. Make sure it works before building your own pipeline.
Development for each language pair is kept in branches.
- English -> Tigrinya
- OpenNMT-py
- Moses
- subword-nmt
- mt-tools
- pandas
Three main directories used in the pipeline are as follows:
- scripts contains the scripts for preparing the data, training models and evaluation
- corpora (created within scripts) is where datasets containing parallel sentences are downloaded and then processed (created within scripts)
- onmt (created within scripts) contains the models, logs, Open-NMT specific datasets, inference results and evaluation scores
The data processing is made visible with the use of suffixes. Each dataset obtains a suffix after a certain pipeline call. The naming convention while processing is as follows:
/corpora/<corpus-name>/<dataset>.<set>.<process1>.<process2>...<processX>.<lang-code>
For example:
/corpora/OPUS/Tatoeba.train.tok.low.en
contains the English sentences in the training portion of the Tatoeba set downloaded from OPUS, tokenized and then lowercased.
Each script consists of three sections:
INITIALIZATIONS
: Sets the paths for later use. This needs not be edited if the directory structure is kept as it isPROCEDURES
: Contains the necessary functions of the pipeline stepCALLS
: This is where the pipeline's procedure is called. Each call consists of a set of parameters being specified followed with a procedure call
The scripts to run are sorted under scripts
directory with a number prefix. Scripts don't take any parameters themselves but need to be edited inside. To run a particular script just type:
bash #-<script-name>.sh
To be used for downloading parallel datasets. Download function is set to work with OPUS links by default. Parameters to specify for each call:
LINK
: URL to corpusCORPUS
: Subdirectory where dataset is kept undercorpora
This will place the parallel files in form of <corpus-name>.<lang1>
, <corpus-name>.<lang2>
under a directory with the name of the corpus under corpora
. If you skip this step make sure you follow a similar pattern of filepath and naming.
Calls various text normalization scripts. For each call specify:
CORPUS
: Subdirectory where dataset is keptC
: Corpus nameLANGS
: Language extensions of the parallel files (lang1
,lang2
)
Cleans empty and dirty samples, allocates a development set and also excludes test samples from a given dataset. Actual processing is done in data_prep.py
. For each call specify:
CORPUS
: Subdirectory where dataset is keptC
: Corpus nameSRC
: Source language codeTGT
: Target language codeSUFFIX
: Complete process suffix that the dataset to be processed hasEXCLUDESET
: Set of sentences that need to be taken out from the train/dev portionEXCLUDEFROM
: side of the exclude set assrc
ortgt
Size of development set is specified in data-prep.py
.
Uses moses tokenizer or any other tokenization script for punctuation tokenization. Set MOSESDIR
to where moses decoder is kept. If a special tokenizer is to be used then it's specified by SRCTOKENIZER
or TGTTOKENIZER
.
For each dataset to be tokenized specify the following:
CORPUS
: subdirectory where dataset is keptC
: Corpus nameSETS
: Subsets (train/test/dev) (If this is used thentokenize_set
procedure needs to be called)SUFFIX
: Complete process suffix that the dataset to be processed has.SRC
: Source language codeTGT
: Target language code
Trains byte-pair-encoding (BPE) tokens from a given set and stores under onmt/bpe
. Uses subword-nmt
library. This should be done on one big set using the parameters:
CORPUS
: Subdirectory where dataset is keptC
: Corpus nameSRC
: Source language code for BPE trainingTGT
: Target language code for BPE trainingBPEID
: ID of the BPE modelOPS
: Number of BPE operations.
Applies BPE tokenization to the datasets. This needs to be done to all train/dev/test sets.
Parameters related to BPE model:
BPEID
: ID of the BPE model with #operationsBPESRC
: Source language code used for BPE trainingBPETGT
: Target language code used for BPE training
For datasets to be bpe-ized:
CORPUS
: Subdirectory where dataset is keptC
: Corpus nameSUFFIX
: Complete process suffix that the dataset to be processed has.SETS
: Subsets (train/test/dev) (If this is used thenapply_bpe_to_sets
procedure needs to be called)SRC
: Source language codeTGT
: Target language code
This converts the prepared training datasets into form that OpenNMT takes in. For each dataset both training and a development set needs to be specificed.
CORPUSTRAIN
: Path to training set without language suffixCORPUSDEV
: Path to development set without language suffixSRCTRAIN
: Source language code for training setTGTTRAIN
: Target language code for training setSRCDEV
: Source language code for development setTGTDEV
: Target language code for development setDATASET
: Name for the training dataset
There are three training scripts: 7a-train-multilingual.sh
, 7b-train-unilingual.sh
, 7c-train-indomain.sh
. This is for three stage training in low-resource settings.
MODELPREFIX
: A label for the modelMODELID
: An ID for the modelDATASET
: Preprocessed training dataset nameBPEID
: ID of the BPE model with #operations
Scripts 7b
and 7c
continues training on the best scoring model from the previous step. These additional parameters need to be set:
BASEMODELTYPE
: Type of the base model (multilingual, unilingual or indomain)BASEMODELID
: ID of the base model
Training parameters are specified in the do_train
procedure while calling OpenNMT-py's training script.
Once training is complete, best scoring model from each step can be found under: onmt/models/<model-prefix>-<model-type>-<model-id>/<model-prefix>-<model-type>-<model-id>_best.pt
This step is for translating test sets using the trained models. For each call specify the following parameters:
MODELPREFIX
: Label of the model to use for inferenceMODELTYPE
: Type of the model to use for inferenceMODELID
: ID of the model to use for inferenceBPEID
: ID of the BPE model with #operationsCORPUSTEST
: Path to testing set without the language suffixSRC
: Source language codeTGT
: Target language code
Inference results (direct and un-bpe'd) are saved under the path onmt/test
.
This final steps calculates various MT evaluation metrics on the inferred translations. Evaluation scripts are accessed from mt-tools.
For each call specify the following:
MODELPREFIX
: Label of the model to use for inferenceMODELTYPE
: Type of the model to use for inferenceMODELID
: ID of the model to use for inferenceBPEID
: ID of the BPE model usedCORPUSTEST
: Path to testing set without the language suffixSRC
: Source language codeTGT
: Target language code
Evaluation results are printed and also stored under the path onmt/eval
.