Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial pipeline #1

Merged
merged 72 commits into from
Jun 17, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
c6a96dc
Copy students scripts. Fix install, download and clean steps
eu9ene May 7, 2021
d8fce65
Teacher scripts
eu9ene May 8, 2021
7b75925
Fix install scripts
eu9ene May 12, 2021
6403742
Fix install reproducibility
eu9ene May 12, 2021
81f41ab
Datasets downloading automation
eu9ene May 13, 2021
9e1e32b
Cleaning fixes
eu9ene May 13, 2021
0a08add
Fix run
eu9ene May 13, 2021
83a3f07
Fix cleaning
eu9ene May 13, 2021
bc996e8
Fix python path
eu9ene May 13, 2021
047f602
Fix clean scripts
eu9ene May 14, 2021
6684801
Fix naming of clean scripts
eu9ene May 14, 2021
c05b0ab
Refactor folders structure
eu9ene May 17, 2021
dc3a809
Remove unused submodules
eu9ene May 17, 2021
fdefe78
Move submodules to a separate dir
eu9ene May 17, 2021
52b7ab9
Refactor model training scripts
eu9ene May 18, 2021
96b6f83
Fix relative paths
eu9ene May 18, 2021
760b8e2
Fix paths in install scripts
eu9ene May 21, 2021
8381fb8
Downloading and training fixes
eu9ene May 21, 2021
625ba9d
Add mono downloading, fix training paths
eu9ene May 21, 2021
4c273fa
Add back translations
eu9ene May 26, 2021
c61f8f7
Fix translations scripts
eu9ene May 27, 2021
8c59a0d
Refactorings
eu9ene May 28, 2021
9895024
Add snakepit runner
eu9ene May 28, 2021
e8ed869
Fix translation
eu9ene Jun 1, 2021
5193478
Add ce filter
eu9ene Jun 1, 2021
8dae039
Ce-filter fixes
eu9ene Jun 2, 2021
492d7c0
Use pigz
eu9ene Jun 2, 2021
218ded5
Add more docs
eu9ene Jun 2, 2021
e5cd54d
Fix alignment
eu9ene Jun 2, 2021
63b5a88
Fix alignment
eu9ene Jun 2, 2021
ea3291e
Fix ce-filter
eu9ene Jun 2, 2021
9823dcd
Use more memory for sorting
eu9ene Jun 2, 2021
8eefb27
Use local tmp dir for alignments
eu9ene Jun 2, 2021
9fc031d
Fix alignment scripts
eu9ene Jun 3, 2021
d213e54
Fix shortlist scripts
eu9ene Jun 4, 2021
9a5c252
Copy teacher vocab
eu9ene Jun 4, 2021
a1e5595
Fix tensorboard
eu9ene Jun 4, 2021
4f0b1aa
Add quantization
eu9ene Jun 4, 2021
1175d2c
Refactorings
eu9ene Jun 4, 2021
fba3257
Move finetuning to a separate folder
eu9ene Jun 7, 2021
6176e25
Refactor training scripts
eu9ene Jun 7, 2021
1a2c910
Fix quantization
eu9ene Jun 7, 2021
926e59e
Fix evaluation of quantized model
eu9ene Jun 7, 2021
d30b97c
Add export
eu9ene Jun 8, 2021
c4b0cc1
Fix export
eu9ene Jun 8, 2021
7ffed00
Fix evaluation
eu9ene Jun 8, 2021
c2fa1e8
Format code
eu9ene Jun 9, 2021
77b2c07
Pin python packages
eu9ene Jun 9, 2021
3a4902b
Add more logging
eu9ene Jun 9, 2021
125f389
Refactor directories
eu9ene Jun 9, 2021
eb8a37e
Fix config
eu9ene Jun 10, 2021
72ba220
Add more docs
eu9ene Jun 10, 2021
82f057f
Add datasets docs
eu9ene Jun 10, 2021
ed49f3c
Fix corpus downloading edge case
eu9ene Jun 10, 2021
e80b5e0
Change paracrawl prefix
eu9ene Jun 10, 2021
e464620
Fix corpus downloading
eu9ene Jun 10, 2021
4207b6c
Fix the main script
eu9ene Jun 10, 2021
2d29cd3
Fix GPUS arg
eu9ene Jun 10, 2021
741a476
Add more docs
eu9ene Jun 10, 2021
9ce07ce
Fix ce filtering
eu9ene Jun 11, 2021
c06c023
Add condition to evaluation
eu9ene Jun 11, 2021
342c6f8
Use augmented dataset for the teacher
eu9ene Jun 11, 2021
7ee1d2a
Fix corpus translation
eu9ene Jun 11, 2021
232de97
Add mode documentation
eu9ene Jun 11, 2021
83e0274
Add architecture section
eu9ene Jun 11, 2021
717d1f2
Ignore pipefail for mono corpus shuffling
eu9ene Jun 12, 2021
e1d27a6
Unquote positional arguments
eu9ene Jun 15, 2021
bb32b18
Add ability to skip corpus augmentation with back-translations
eu9ene Jun 15, 2021
4203e87
Add instructions how to run tensorboard
eu9ene Jun 15, 2021
9efb3a5
Add more checks to cleaning scripts
eu9ene Jun 15, 2021
db997ee
Extract variable for teacher corpus
eu9ene Jun 15, 2021
68ba306
Add experiment name to directories structure
eu9ene Jun 16, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add back translations
  • Loading branch information
eu9ene committed May 26, 2021
commit 4c273fac4146ad3324dc0c0fd08c42bf4546f70c
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -129,3 +129,6 @@ dmypy.json
.pyre/

.idea
.data
.models
.bin
9 changes: 5 additions & 4 deletions config.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,16 @@ TRG=en
# parallel corpus
TRAIN_DATASETS="opus_OPUS-ParaCrawl/v7.1"
DEVTEST_DATASETS="mtdata_newstest2019_ruen mtdata_newstest2017_ruen mtdata_newstest2015_ruen mtdata_newstest2014_ruen"
# sacrebleu
TEST_DATASETS="wmt20 wmt18 wmt16 wmt13"
# mono for source language (ex. paracrawl_paracrawl8 commoncrawl_wmt16)
# monolingual datasets (ex. paracrawl_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
MONO_DATASETS_SRC="news-crawl_news.2020"
MONO_DATASETS_TRG="paracrawl_paracrawl8"
MONO_DATASETS_TRG="news-crawl_news.2020"
MONO_MAX_SENTENCES_SRC=100000000
MONO_MAX_SENTENCES_TRG=10000000
MONO_MAX_SENTENCES_TRG=20000000


# marian --devices parameter for GPUs to use, for example 0 1 2 3
GPUS=$(seq -s " " 0 $(( $(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-1 )))
# for 12 GB GPU
WORKSPACE=8000
WORKSPACE=9000
5 changes: 3 additions & 2 deletions pipeline/data/download-corpus.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,9 @@ if [ ! -e ${trg_corpus} ]; then

for dataset in $datasets; do
echo "Downloading dataset ${dataset}"
name=${dataset#_*}
bash ./importers/corpus/${dataset%_*}.sh $SRC $TRG $dir $name
name=${dataset#*_}
type=${dataset%_*}
bash ${WORKDIR}/pipeline/data/importers/corpus/${type}.sh $SRC $TRG $dir $name
done

cat ${dir}/train-parts/*."${SRC}" | pigz > "$src_corpus"
Expand Down
13 changes: 7 additions & 6 deletions pipeline/data/download-mono.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,19 +24,20 @@ if [ ! -e ${file_name} ]; then

for dataset in $datasets; do
echo "Downloading dataset ${dataset}"
name=${dataset#*_}
source_path=$dir/$dataset.original.$lang
source_prefix=$dir/$dataset.original.$lang
gz_path=$dir/$dataset.$lang.gz
name=${dataset#*_}
type=${dataset%_*}

name=${dataset#_*}
bash ./importers/mono/${dataset%_*}.sh $lang $dir $name
test -s $source_prefix.gz || \
bash ${WORKDIR}/pipeline/data/importers/mono/${type}.sh $lang $source_prefix $name

test -s $gz_path || \
zcat $source_path.gz | shuf -n $(bc -l <<< "${max_sent}+${max_sent}*${coef}") | \
zcat $source_prefix.gz | shuf -n $(bc -l <<< "${max_sent}+${max_sent}*${coef}") | \
perl -ne 'print if(split(/\s/, $_) < 100)' | \
head -n "$max_sent" | pigz > $gz_path

rm $source_path.*
rm $source_prefix*
done

zcat ${dir}/*.$lang.gz | pigz > $file_name
Expand Down
12 changes: 5 additions & 7 deletions pipeline/data/importers/mono/commoncrawl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,16 @@
# Downloads monolingual data from commoncrawl
#
# Usage:
# bash commoncrawl.sh lang dir dataset
# bash commoncrawl.sh lang output_prefix dataset
#

set -x
set -euo pipefail

lang=$1
dir=$2
output_prefix=$2
dataset=$3

source_path=$dir/$dataset.original.$lang

test -s $source_path.xz || \
wget -O $source_path.xz http://web-language-models.s3-website-us-east-1.amazonaws.com/${name}/deduped/${lang}.xz
xzcat $source_path.xz | pigz > $source_path.gz
test -s ${output_prefix}.gz || \
rm ${output_prefix}.gz && wget -O ${output_prefix}.xz http://web-language-models.s3-website-us-east-1.amazonaws.com/${dataset}/deduped/${lang}.xz
xzcat $output_prefix.xz | pigz > $output_prefix.gz
9 changes: 4 additions & 5 deletions pipeline/data/importers/mono/news-crawl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,17 @@
# Downloads monolingual data from OPUS
#
# Usage:
# bash opus.sh lang dir dataset
# bash opus.sh lang output_prefix dataset
#


set -x
set -euo pipefail

lang=$1
dir=$2
output_prefix=$2
dataset=$3

source_path=$dir/$dataset.original.$lang

test -s $source_path.gz || \
wget -O $source_path.gz http://data.statmt.org/news-crawl/${lang}/${name}.${lang}.shuffled.deduped.gz
test -s $output_prefix.gz || \
wget -O $output_prefix.gz http://data.statmt.org/news-crawl/${lang}/${dataset}.${lang}.shuffled.deduped.gz
6 changes: 3 additions & 3 deletions pipeline/data/importers/mono/paracrawl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,14 @@ set -x
set -euo pipefail

lang=$1
dir=$2
output_prefix=$2
dataset=$3


if [[ $lang == "en" ]]
then
source_path=$dir/$dataset.original.$lang
test -s $source_path.gz || wget -nc -O $source_path.gz https://neural.mt/data/$dataset-mono/en-000.gz
test -s $output_prefix.gz || \
wget -O $output_prefix.gz https://neural.mt/data/${dataset}-mono/en-000.gz
else
echo "Only English language is supported at this time for paracrawl"
exit 1
Expand Down
4 changes: 2 additions & 2 deletions pipeline/train/configs/training/s2s.train.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# https://github.com/marian-nmt/marian-examples/tree/master/wmt2017-uedin
## https://github.com/marian-nmt/marian-examples/tree/master/wmt2017-uedin
after-epochs: 10
beam-size: 12
cost-type: ce-mean-words
Expand All @@ -11,4 +11,4 @@ mini-batch-fit: True
normalize: 1
save-freq: 10000
valid-freq: 10000
valid-mini-batch: 64
valid-mini-batch: 64
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# https://github.com/marian-nmt/marian-examples/tree/master/wmt2017-uedin
after-epochs: 8
beam-size: 12
clip-norm: 5
cost-type: ce-mean-words
Expand Down
7 changes: 5 additions & 2 deletions pipeline/train/eval.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,15 @@ set -euo pipefail

test -v GPUS
test -v MARIAN
tests -v WORKSPACE
test -v WORKSPACE

model_dir=$1
src="${2:-$SRC}"
trg="${3:-$TRG}"
test_datasets=${{@:4}:-$TEST_DATASETS}
datasets=${@:4}
test_datasets=${datasets:-$TEST_DATASETS}

mo



Expand Down
5 changes: 3 additions & 2 deletions pipeline/train/tensorboard/tb_log_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,9 +196,10 @@ def update_all_avg(self):


@click.command()
@click.option('--dir')
@click.option('--prefix',
default='model')
def run(prefix):
default='')
def run(dir, prefix):
monitors = {}

while True:
Expand Down
5 changes: 2 additions & 3 deletions pipeline/train/tensorboard/tesnsorboard.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@

conda activate bergamot-training-env

python ../marian-tensorboard/tb_log_parser.py --prefix=

tensorboard --logdir=./ --host=0.0.0.0
python ./tb_log_parser.py --prefix= & \
tensorboard --logdir=./ --host=0.0.0.0 && fg
2 changes: 1 addition & 1 deletion pipeline/train/train-teacher-ensemble.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ do
${WORKDIR}/pipeline/train/configs/training/teacher.transformer-ens.train.yml \
$SRC \
$TRG \
${DATA_DIR}/clean/corpus \
${DATA_DIR}/augmented/corpus \
${DATA_DIR}/original/devset \
${MODELS_DIR}/$SRC-$TRG/teacher-ens$i
done
11 changes: 11 additions & 0 deletions pipeline/translate/decoder.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
normalize: 1.0
word-penalty: 0
mini-batch: 16
mini-batch-words: 2000
maxi-batch: 1000
maxi-batch-sort: src
workspace: 8000
max-length: 200
max-length-crop: true
beam-size: 8
quiet-translation: True
42 changes: 26 additions & 16 deletions pipeline/translate/translate-mono.sh
Original file line number Diff line number Diff line change
@@ -1,35 +1,45 @@
#!/bin/bash

# Usage: ./translate-mono.sh -d 4 5 6 7
# Usage: ./translate-mono.sh mono_path model_dir output_path

set -e
set -x
set -euo pipefail

# Adjust these variables if needed.
MARIAN=../../marian-dev/build
CORPUSSRC=mono.en.gz
CONFIG=teacher.yml
DIR=mono
OUTPUT=$DIR.translated.gz

mkdir -p $DIR
test -v GPUS
test -v MARIAN


mono_path=$1
model_dir=$2
output_path=$3

config=${model_dir}/model.npz.best-ce-mean-words.npz.decoder.yml
decoder_config=${WORKDIR}/pipeline/translate/decoder.yml
tmp_dir=$(dirname $output_path)/tmp
mkdir -p $tmp_dir


# Split the corpus into smaller chunks.
test -s $DIR/file.00 || pigz -dc $CORPUSSRC | split -d -l 2000000 - $DIR/file.
test -s $tmp_dir/file.00 || pigz -dc $mono_path | split -d -l 2000000 - $tmp_dir/file.

# Translate source sentences with Marian.
# This can be parallelized across several GPU machines.
for prefix in `ls $DIR/file.?? | shuf`; do
for prefix in `ls ${tmp_dir}/file.?? | shuf`; do
echo "# $prefix"
test -e $prefix.out || $MARIAN/marian-decoder -c $CONFIG -i $prefix -o $prefix.out --log $prefix.log -b 4 $@
test -e $prefix.out || \
$MARIAN/marian-decoder -c $config $decoder_config -i $prefix -o $prefix.out --log $prefix.log \
-d $GPUS -w $WORKSPACE
done

# Collect translations.
cat $DIR/file.??.out | pigz > $OUTPUT
cat $tmp_dir/file.??.out | pigz > $output_path

# Source and artificial target files must have the same number of sentences,
# otherwise collect the data manually.
echo "# sentences $CORPUSSRC vs $OUTPUT"
pigz -dc $CORPUSSRC | wc -l
pigz -dc $OUTPUT | wc -l
echo "# sentences $mono_path vs $output_path"
pigz -dc $mono_path | wc -l
pigz -dc $output_path | wc -l

rm -rf $tmp_dir

29 changes: 20 additions & 9 deletions run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ set -euo pipefail
#│ │ ├ corpus.en.gz
#│ │ ├ mono.ru.gz
#│ │ ├ mono.en.gz
#│ ├ translated
#│ │ ├ mono.ru.gz
#│ │ ├ mono.en.gz
#│ ├ augmented
#│ │ ├ corpus.ru.gz
#│ │ ├ corpus.en.gz
Expand All @@ -33,6 +36,7 @@ set -euo pipefail




set -a
. ./config.sh
set +a
Expand All @@ -47,26 +51,33 @@ original=${DATA_DIR}/original
. ./pipeline/data/download-corpus.sh ${original}/corpus $TRAIN_DATASETS
. ./pipeline/data/download-corpus.sh ${original}/devset $DEVTEST_DATASETS
if [[ ${MONO_DATASETS_SRC} ]]; then
. ./pipeline/data/download-mono.sh ${SRC} $MONO_MAX_SENTENCES_SRC ${original}/mono $MONO_DATASETS_SRC
. ./pipeline/data/download-mono.sh ${SRC} $MONO_MAX_SENTENCES_SRC ${original}/mono $MONO_DATASETS_SRC
fi
if [[ ${MONO_DATASETS_TRG} ]]; then
. ./pipeline/data/download-mono.sh ${TRG} $MONO_MAX_SENTENCES_TRG ${original}/mono $MONO_DATASETS_TRG
. ./pipeline/data/download-mono.sh ${TRG} $MONO_MAX_SENTENCES_TRG ${original}/mono $MONO_DATASETS_TRG
fi

clean=${DATA_DIR}/clean
. ./pipeline/clean/clean-corpus.sh ${original}/corpus ${clean}/corpus
if [[ -e ${DATA_DIR}/original/mono.${SRC}.gz ]]; then
. ./pipeline/clean/clean-mono.sh ${SRC} ${original}/mono ${clean}/mono
. ./pipeline/clean/clean-mono.sh ${SRC} ${original}/mono ${clean}/mono
fi
if [[ -e ${DATA_DIR}/original/mono.${TRG}.gz ]]; then
. ./pipeline/clean/clean-mono.sh ${TRG} ${original}/mono ${clean}/mono
if [[ -e ${original}/mono.${TRG}.gz ]]; then
. ./pipeline/clean/clean-mono.sh ${TRG} ${original}/mono ${clean}/mono
fi

. ./pipeline/train/train-s2s.sh $TRG $SRC
. ./pipeline/train/eval.sh ${MODELS_DIR}/teacher-ens $TRG $SRC
. ./pipeline/train/eval.sh ${MODELS_DIR}/$TRG-$SRC/s2s $TRG $SRC


# TODO: backtranslate and augment corpus
. ./pipeline/translate/translate-mono.sh ${clean}/mono.$TRG.gz ${MODELS_DIR}/$TRG-$SRC/s2s ${DATA_DIR}/translated/mono.$SRC.gz

augmented=${DATA_DIR}/augmented
mkdir -p $augmented
test -s $augmented/corpus.$SRC.gz || cat ${DATA_DIR}/translated/mono.$SRC.gz ${DATA_DIR}/clean/corpus.$SRC.gz > $augmented/corpus.$SRC.gz
test -s $augmented/corpus.$TRG.gz || cat ${clean}/mono.$TRG.gz ${DATA_DIR}/clean/corpus.$TRG.gz > $augmented/corpus.$TRG.gz
pigz -dc $augmented/corpus.$SRC.gz | wc -l
pigz -dc $augmented/corpus.$TRG.gz | wc -l

. ./pipeline/train/train-teacher-ens.sh
. ./pipeline/train/eval.sh ${MODELS_DIR}/teacher-ens
. ./pipeline/train/train-teacher.sh
. ./pipeline/train/eval.sh ${MODELS_DIR}/$SRC-$TRG/teacher