DeepHack.Babel

This time the topic of our new DeepHack hackathon is machine translation (MT). At the hackathon we will tackle the task of semi-supervised neural MT: we'll try to improve a neural MT model with monolingual data. However, our qualification task is traditional supervised MT It serves for familiarising participants with the MT field.

For this task we have decided to pick a well-studied language pair (English-German) and put no restrictions on the used data and model architecture.

Submissions

#	Score	Description	Image
1	0.17709	OpenNMT-py with the `demo.rnn__acc_47.64_ppl_20.48_e13-fixed.pt` model	aoboturov/deephack
2	0.23514	OpenNMT with the `onmt_baseline_wmt15-all.en-de_epoch13_7.19_release.t7` model	aoboturov/deephack-v2
3	0.21108	OpenNMT with the `standard_full_model_epoch4_4.34_release.t7` model	aoboturov/deephack-v3
4	0.24765	OpenNMT with the `standard_full_model_epoch9_3.65_release.t7` model	aoboturov/deephack-v4
5	0.25349	OpenNMT with the `standard_full_model_epoch13_3.51_release.t7` model	aoboturov/deephack-v5
6	0.13219	Fairseq-py with the `wmt14.en-de.fconv-py.tar.bz2` model	aoboturov/deephack-v6
7	TLE	Tensorflow NMT with the `ende_gnmt_model_8_layer` model	aoboturov/deephack-v7
8	0.07289	Seq2Seq small model with computational budget 325000	aoboturov/deephack-v8
9	0.07624	Seq2Seq small model with computational budget 667001	aoboturov/deephack-v9
10	TLE	Nematus pretrained single model	kwakinalabs/abjskllld-v1
11	TLE	T2T default single model	kwakinalabs/cfbfgdfds-v1
12	0.18967	Sockeye default model trained with custom data	kwakinalabs/fkskskkas-v1
13	0.27924	MarianNMT default single model	kwakinalabs/slkkajjss-v1

Task

The task is translation of IT-texts from English into German, in line with WMT'16 IT translation task: http://www.statmt.org/wmt16/it-translation-task.html

There is no restriction on training data for the task. You can use the data provided for the WMT'16 IT translation task or any other datasets. Test data consists of novel texts in IT domain.

Metric

We will evaluate submissions with BLEU score.

BLEU - the BLEU score measures how many words and ngrams (n consecutive words) overlap in a given translation and a reference translation. The most commonly used BLEU version is BLEU-4, which considers words, bigrams, trigrams and 4-grams. It also uses a penalty for too short traslations.

For local validation you can use the reference implementation of BLEU from MOSES or its Python version from Google.

Submission

Your solution should be a Docker file which mounts two folders input & output and is run by the docker run -v /path/to/input_data:/data -v /path/to/output:/output -t {image} {entry_point} command. Submission must be in the form of zip-archive containing file metadata.json.

metadata.json should contain the following fields:

entry_point - command to run Docker container with
image - Docker repo name

input folder contains a file input.txt with the test sentences in the source language (sentences that the model should translate). The model should write translated sentences to output.txt in output folder.

We provide a sample solution. The sample submission is included into this repository as sample_submission.zip.

Read Tips to Reduce Docker Image Sizes.

docker build -t aoboturov/deephack .
docker login
docker push aoboturov/deephack

Baseline solution for DeepHack.Babel

Baseline solution for DeepHack.Babel

OpenNMT

w/ Demo Model

Built from the standard OpenNMT-py with a demo model. This model has to be fixed because serialization format has changed.

zip aoboturov-openmnt-demo-model.zip metadata.json run.sh

w/ Custom Model

Data Preprocessing

Data preprocessing takes about 40 mins for the small vocabulary size. The 50k vocabulary runs in about the same time.

cut -f 3 gazetteerDE.tsv | sed 's/^"*//;s/"*$//' > gazetteer.en
cut -f 4 gazetteerDE.tsv | sed 's/^"*//;s/"*$//' > gazetteer.de

cat train_src/* > train.en
cat train_tgt/* > train.de

cat train.en | grep "^$" | wc -l
cat train.de | grep "^$" | wc -l

python3 ../data_cleaner.py train.en train.de

python3 preprocess.py -train_src data/train.en -train_tgt data/train.de -valid_src wmt16-it-task-references/Batch3a_en.txt -valid_tgt wmt16-it-task-references/Batch3a_de.txt -save_data data/out -src_vocab_size 1000 -tgt_vocab_size 1000 -max_shard_size 67108864

OpenNMT Custom Model training

Here's a tutorial

th tools/tokenize.lua < data/train.en > data/train.en.tok
th tools/tokenize.lua < data/train.de > data/train.de.tok
th tools/tokenize.lua < wmt16-it-task-references/Batch3a_en.txt > data/valid.en.tok
th tools/tokenize.lua < wmt16-it-task-references/Batch3a_de.txt > data/valid.de.tok

th preprocess.lua -train_src data/train.en.tok -train_tgt /data/train.de.tok -valid_src data/valid.en.tok -valid_tgt data/valid.de.tok -save_data data/standard_full

nohup th train.lua -data data/standard_full-train.t7  -save_model data/standard_full_model -gpuid 1 &
nvidia-smi
top
tail -f nohup.out

Once trained, the model has to be released (allows it to be ran on the CPU):

lua tools/release_model.lua -model data/standard_full_model_epoch4_4.34.t7 -gpuid 1

To resume a stopped training one could use:

nohup th train.lua -data data/standard_full-train.t7  -save_model data/standard_full_model -gpuid 1 -train_from data/standard_full_model_epoch10_3.58.t7 -continue &

Training Sequence to Sequence with Attention model...
Loading data from 'data/standard_full-train.t7'...
 - vocabulary size: source = 50004; target = 50004
 - additional features: source = 0; target = 0
 - maximum sequence length: source = 50; target = 51
 - number of training sentences: 4341009
 - number of batches: 67852
   - source sequence lengths: equal
   - maximum size: 64
   - average size: 63.98
   - capacity: 100.00%
Building model...
 - Encoder:
   - word embeddings size: 500
   - type: unidirectional RNN
   - structure: cell = LSTM; layers = 2; rnn_size = 500; dropout = 0.3 (naive)
 - Decoder:
   - word embeddings size: 500
   - attention: global (general)
   - structure: cell = LSTM; layers = 2; rnn_size = 500; dropout = 0.3 (naive)
 - Bridge: copy
Initializing parameters...
 - number of parameters: 84822004
Preparing memory optimization...
 - sharing 69% of output/gradInput tensors memory between clones
Start training from epoch 1 to 13...

Epoch	Validation perplexity	File name
1	5.52	data/standard_full_model_epoch1_5.52.t7
2	4.73	data/standard_full_model_epoch2_4.73.t7
3	4.54	data/standard_full_model_epoch3_4.54.t7
4	4.34	data/standard_full_model_epoch4_4.34.t7
5	4.46	data/standard_full_model_epoch5_4.46.t7
6	3.96	data/standard_full_model_epoch6_3.96.t7
7	3.83	data/standard_full_model_epoch7_3.83.t7
8	3.65	data/standard_full_model_epoch8_3.65.t7
9	3.65	data/standard_full_model_epoch9_3.65.t7
10	3.58	data/standard_full_model_epoch10_3.58.t7
11	3.54	data/standard_full_model_epoch11_3.54.t7
12	3.55	data/standard_full_model_epoch12_3.55.t7
13	3.51	data/standard_full_model_epoch13_3.51.t7

Model averaging

lua tools/average_models.lua -models data/standard_full_model_epoch14_3.51.t7 data/standard_full_model_epoch15_3.51.t7 data/standard_full_model_epoch16_3.50.t7 data/standard_full_model_epoch17_3.49.t7 -force -gpuid 1

Seq2Seq

No default model was provided - we had to train model by ourselves. Preprocessing is done with custom script seq2seq-preprocessing/babel_en_de.sh. Once finished the model could be trained with the seq2seq-preprocessing/babel_training.sh script.

Training process could be observer in the tensorboard:

nohup ./babel_en_de.sh &
nvidia-smi
top
tail -f nohup.out
tensorboard --logdir $MODEL_DIR

Computational Budget	Validation Score
325000	0.07289
447001	0.07518
667001	0.07624

model.ckpt-447001.meta

Facebook AI Research Sequence-to-Sequence Toolkit

Model training takes multiple more than two weeks - so there's a very strong incentive to use the pre-trained model.

The original Facebook AI Research Sequence-to-Sequence Toolkit would likely OOM because of memory restrictions of the Lua VM.

Tensorflow Seq2Seq NMT

The default pre-trained model has some problems.

To debug the checkpoint one could use:

python3 -m tensorflow.python.tools.inspect_checkpoint --all_tensor_names --file_name checkpoint.ckpt

Marian-NMT

Pre-trained models single models

One could use a script to run the inference for a single model. All the pre-trained systems could be obtained from The University of Edinburgh’s WMT16 systems.

NEMATUS

Tensor2Tensor

T2T default model is based on Google Brain team notebook.

gsutil cp gs://tensor2tensor-data/vocab.ende.32768 .
gsutil -q cp -R gs://tensor2tensor-checkpoints/ ckpt

Sockeye

Training is done by running the following commands:

pip3 install sockeye --no-deps numpy mxnet-cu80==1.0.0 pyyaml typing

python3 -m learn_joint_bpe_and_vocab --input train.clean.en train.clean.de \
                                     -s 30000 \
                                     -o bpe.codes \
                                     --write-vocabulary bpe.vocab.en bpe.vocab.de

python3 -m apply_bpe -c bpe.codes --vocabulary bpe.vocab.en --vocabulary-threshold 50 < train.clean.en > train.clean.BPE.en
python3 -m apply_bpe -c bpe.codes --vocabulary bpe.vocab.de --vocabulary-threshold 50 < train.clean.de > train.clean.BPE.de

python3 -m apply_bpe -c bpe.codes --vocabulary bpe.vocab.en --vocabulary-threshold 50 < Batch3a_en.txt > Batch3a.BPE.en
python3 -m apply_bpe -c bpe.codes --vocabulary bpe.vocab.de --vocabulary-threshold 50 < Batch3a_de.txt > Batch3a.BPE.de

python -m sockeye.train -s train.clean.BPE.en \
                        -t train.clean.BPE.de \
                        -vs Batch3a.BPE.en \
                        -vt Batch3a.BPE.de \
                        --num-embed 256 \
                        --rnn-num-hidden 512 \
                        --rnn-attention-type dot \
                        --max-seq-len 60 \
                        --decode-and-evaluate 500 \
                        --device-ids 0 \
                        -o wmt_model

The sentences to be translated

In the Bullets And Numbering dialog box , select Bullets from the List Type menu .
Acrobat 4.0 and 5.0 require that sounds be embedded and movies be linked .
A new bitmap object based on the pixel selection is created in the current layer , and the selected pixels are removed from the original bitmap object .
To finish the measurement , right-click and select Complete Measurement .
Select the text , frame , or graphic you want to be the source of the hyperlink .

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
fairseq-py		fairseq-py
fairseq		fairseq
marian-nmt		marian-nmt
nematus		nematus
opennmt-preprocessing		opennmt-preprocessing
opennmt-py		opennmt-py
opennmt		opennmt
seq2seq-preprocessing		seq2seq-preprocessing
seq2seq		seq2seq
sockeye		sockeye
t2t		t2t
tensorflow-seq2seq		tensorflow-seq2seq
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepHack.Babel

Submissions

Task

Metric

Submission

Baseline solution for DeepHack.Babel

OpenNMT

w/ Demo Model

w/ Custom Model

Data Preprocessing

OpenNMT Custom Model training

Model averaging

Seq2Seq

Facebook AI Research Sequence-to-Sequence Toolkit

Tensorflow Seq2Seq NMT

Marian-NMT

Pre-trained models single models

NEMATUS

Tensor2Tensor

Sockeye

The sentences to be translated

About

Releases

Packages

Languages

aoboturov/aoboturov-deephack-babel-qualification

Folders and files

Latest commit

History

Repository files navigation

Submissions

Task

Metric

Submission

Baseline solution for DeepHack.Babel

w/ Demo Model

w/ Custom Model

Data Preprocessing

OpenNMT Custom Model training

Model averaging

Pre-trained models single models

The sentences to be translated

About

Resources

Stars

Watchers

Forks

Languages