Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality improvements #29

Merged
merged 191 commits into from
Dec 6, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
191 commits
Select commit Hold shift + click to select a range
ba985dc
Shuffle corpus after merge
eu9ene Aug 13, 2021
1cb7955
Automatically exclude WMT news from opus data
eu9ene Aug 13, 2021
ca83f89
Add flores importer
eu9ene Aug 13, 2021
2bbc857
Add custom importers
eu9ene Aug 13, 2021
9ba4da8
Fix condition in corpus translation, replace ls grep
eu9ene Aug 13, 2021
92205ce
Add experiments tracking and caching interface
eu9ene Aug 13, 2021
3780866
Add ability to use pretrained backward model
eu9ene Aug 13, 2021
e2e23eb
Simple snakemake pipeline
eu9ene Aug 14, 2021
139c12c
Fix data downloading
eu9ene Aug 14, 2021
63d17ed
Use tmp directories in sort
eu9ene Aug 16, 2021
2f492fe
Fix experiment saving
eu9ene Aug 16, 2021
09658a0
Exclude UN from mtdata
eu9ene Aug 16, 2021
9e7bd4b
Fix corpus merging
eu9ene Aug 16, 2021
90c8013
Fix downloading evaluation datasets
eu9ene Aug 16, 2021
b345dd9
Fix run.sh
eu9ene Aug 16, 2021
ced12d6
Use best bleu models
eu9ene Aug 16, 2021
31bc4be
Merge branch 'improvements' into snakemake
eu9ene Aug 16, 2021
80ce666
Snakemake environment
eu9ene Aug 16, 2021
5cb7029
Fix conda activation
eu9ene Aug 16, 2021
3ed0cce
Fix conda env
eu9ene Aug 17, 2021
9aa940c
Remove dry run
eu9ene Aug 17, 2021
bee95fd
Add number of threads
eu9ene Aug 17, 2021
aeaeaec
Add monitoring
eu9ene Aug 17, 2021
01cd9da
Add threads, fix teacher
eu9ene Aug 17, 2021
85b231a
Remove python env activation
eu9ene Aug 17, 2021
b532e42
Add tmp id for corpus downloading
eu9ene Aug 17, 2021
5358239
Fix teacher
eu9ene Aug 17, 2021
a083ae4
Fix output
eu9ene Aug 17, 2021
e3f1cea
Use rules outputs
eu9ene Aug 17, 2021
ebe1fd4
Remove envs config
eu9ene Aug 18, 2021
30b69b6
Add snakepit profile
eu9ene Aug 18, 2021
411feed
Fix makefile
eu9ene Aug 18, 2021
06ac1db
Parallelize teacher training
eu9ene Aug 18, 2021
5466c06
Fix training script
eu9ene Aug 18, 2021
382b3b1
Modularize workflow
eu9ene Aug 19, 2021
6345cff
Fix makefile
eu9ene Aug 19, 2021
fb25b43
Fix makefile
eu9ene Aug 19, 2021
40c5187
Fix submit
eu9ene Aug 19, 2021
3a61340
Fix setup
eu9ene Aug 20, 2021
b82de81
Parallelize translation
eu9ene Aug 27, 2021
3d91d3e
Fix split
eu9ene Aug 27, 2021
c8a5df1
Fix part naming
eu9ene Aug 27, 2021
1d784b3
Remove logs from output
eu9ene Aug 30, 2021
0319e3c
Fix logging
eu9ene Aug 30, 2021
4445e83
Fix mtdata importer
eu9ene Aug 30, 2021
51593cc
Use 8 GPUs by default
eu9ene Aug 30, 2021
8da028f
Change default train dataset
eu9ene Aug 30, 2021
9d5f7f8
Add mono translation, limit gpus
eu9ene Aug 30, 2021
e980f25
Fix translation
eu9ene Aug 30, 2021
004f592
Fix gpu limit
eu9ene Aug 30, 2021
8246dc0
Fix extract best
eu9ene Aug 30, 2021
004ae15
Add more steps
eu9ene Aug 31, 2021
80dfecb
Fix array arguments
eu9ene Aug 31, 2021
452d542
Remove workdirs
eu9ene Aug 31, 2021
8d466a7
Add the rest of the steps, refactoring
eu9ene Sep 1, 2021
9f745eb
Refactor configuration
eu9ene Sep 1, 2021
800745e
Add kenlm
eu9ene Sep 1, 2021
2aab393
Add logging to stdout
eu9ene Sep 1, 2021
9cffd7a
Use working directory
eu9ene Sep 1, 2021
7f74aad
Save experiment, fix mono output
eu9ene Sep 1, 2021
2344ef5
Fix mono, inputs
eu9ene Sep 2, 2021
5a26d74
Fix arrays usage
eu9ene Sep 2, 2021
187f606
Fix kenlm output
eu9ene Sep 2, 2021
5f54a08
Use separate envs for bicleaner and bicleaner-ai
eu9ene Sep 3, 2021
ac5c39f
Fix envs
eu9ene Sep 3, 2021
253edcb
Move ce-filer, fix kenlm
eu9ene Sep 3, 2021
8b64702
Fix experiment saving
eu9ene Sep 3, 2021
9700ce9
Fix backward
eu9ene Sep 3, 2021
262ae54
Fix model output dirs
eu9ene Sep 3, 2021
94d794a
Add shuffling to merge
eu9ene Sep 3, 2021
a1950e6
Fix evaluation workflow
eu9ene Sep 3, 2021
488dec6
Fix collect corpus
eu9ene Sep 3, 2021
f6a7b3e
Fix eval downloading
eu9ene Sep 3, 2021
a8daadb
Fix ce filter
eu9ene Sep 3, 2021
e845bb4
Fix quantization
eu9ene Sep 3, 2021
6fe5b4a
Fix comment
eu9ene Sep 3, 2021
81e5b60
Use shell prefix
eu9ene Sep 3, 2021
1fdb67f
Fix report metadata
eu9ene Sep 6, 2021
29e53f2
Implement reporting
eu9ene Sep 6, 2021
1884518
Add file server runner
eu9ene Sep 6, 2021
74a5953
Fix makefile
eu9ene Sep 6, 2021
048d867
Fix caption path
eu9ene Sep 6, 2021
2159fa3
Refactor configuration
eu9ene Sep 7, 2021
bb764de
Fix config loader
eu9ene Sep 7, 2021
1c85ecc
Fix makefile
eu9ene Sep 7, 2021
877b65b
Dynamic sharding
eu9ene Sep 7, 2021
8e0da72
Fix extract best
eu9ene Sep 8, 2021
109a6ba
Fix submit script
eu9ene Sep 10, 2021
ab414c5
Add containerization
eu9ene Sep 10, 2021
1a36338
Add slurm profile
eu9ene Oct 5, 2021
04f9ac6
Add job scripts
eu9ene Oct 5, 2021
54a1db9
Compile marian on gpu
eu9ene Oct 5, 2021
9006584
Use singularity
eu9ene Oct 5, 2021
18f0ae1
Fix slurm running
eu9ene Oct 6, 2021
2273a8b
Make slurm scripts executable
eu9ene Oct 6, 2021
755164c
Fix slurm submit
eu9ene Oct 6, 2021
2566667
Fix job name
eu9ene Oct 6, 2021
14006c7
Remove setup step
eu9ene Oct 6, 2021
6bb041a
Fix snakefile
eu9ene Oct 6, 2021
021514c
Move dirs to config
eu9ene Oct 6, 2021
a2a131e
Use different partitions for cpu and gpu
eu9ene Oct 6, 2021
1c1a837
Refactor configuration
eu9ene Oct 7, 2021
4b17618
Fix config step
eu9ene Oct 7, 2021
fdb6444
Fix config step
eu9ene Oct 7, 2021
7ac152b
Add cmake to bicleaner env
eu9ene Oct 8, 2021
99c28dd
Mount cuda dir only for gpu jobs
eu9ene Oct 8, 2021
cb94a33
Fix environments
eu9ene Oct 13, 2021
9b52973
Use cluster modules
eu9ene Oct 13, 2021
dfe06fa
Fix jobscript
eu9ene Oct 15, 2021
eb3dfee
Add conditional dependencies installation
eu9ene Oct 15, 2021
6d257d1
Move install-deps to config
eu9ene Oct 15, 2021
a0031fd
Refactor configuration
eu9ene Oct 15, 2021
d2f5e57
Fix configuration
eu9ene Oct 15, 2021
c1ef7e2
Fix experiment saving
eu9ene Oct 15, 2021
c001808
Replace json to yaml
eu9ene Oct 15, 2021
6495ec7
Fix install deps configuration
eu9ene Oct 15, 2021
6203fc6
Update datasets
eu9ene Oct 16, 2021
5590641
Fix reporting
eu9ene Oct 16, 2021
ffceac3
Fix train datasets
eu9ene Oct 18, 2021
178252c
Add tmp dir to clean mono
eu9ene Oct 18, 2021
fb5c96b
Add non container slurm
eu9ene Oct 19, 2021
6d0e3ee
Fix vocab training
eu9ene Oct 19, 2021
78adf65
Fix vocab training
eu9ene Oct 19, 2021
8a0a1a2
Lower threads for non cpu intensive tasks
eu9ene Oct 19, 2021
4254234
Add account to slurm submit
eu9ene Oct 19, 2021
4dea796
Fix slurm submit
eu9ene Oct 19, 2021
5f0b4b5
Fix marian threads
eu9ene Oct 19, 2021
cc28d6e
Reduce sample size for vocab
eu9ene Oct 20, 2021
e9304f2
Silent installation of snakemake
eu9ene Oct 20, 2021
a5cf84a
Limit number of epochs for testing
eu9ene Oct 20, 2021
44b32f7
Make node memory configurable
eu9ene Oct 20, 2021
f67871e
Log slurm command
eu9ene Oct 21, 2021
a613296
Change logging level
eu9ene Oct 21, 2021
9c0aa7c
Update teacher hyper params
eu9ene Oct 22, 2021
1462346
Merge branch 'main' into quality
eu9ene Oct 28, 2021
994dbdd
Remove punctuation normalization
eu9ene Oct 30, 2021
e9b98c6
Fix teacher hyper params
eu9ene Nov 1, 2021
67301ee
Continue training teacher on parallel corpus
eu9ene Nov 1, 2021
30c7165
Add more languages
eu9ene Nov 1, 2021
abd644c
Fix ce filter
eu9ene Nov 2, 2021
f61aecd
Add ensemble evaluation
eu9ene Nov 3, 2021
16c3126
Add translation verification
eu9ene Nov 4, 2021
944032c
Add teacher recipe source
eu9ene Nov 5, 2021
c01dd7f
Add dataset specific fixes
eu9ene Nov 5, 2021
f73d862
Modify pipeline for per dataset cleaning
eu9ene Nov 5, 2021
4852740
Add custom bicleaner thresholds
eu9ene Nov 17, 2021
4c85846
Fix dataset specific modifications
eu9ene Nov 18, 2021
8688f6e
Refactor corpus downloading
eu9ene Nov 18, 2021
eb45619
Make evaluation work with compressed corpus
eu9ene Nov 18, 2021
59a362b
Add DAG schema
eu9ene Nov 18, 2021
e6561a1
Fix mtdata importer
eu9ene Nov 19, 2021
1c7884f
Extract bilcleaner pack downloading
eu9ene Nov 19, 2021
4e93854
Fix mono cleaning
eu9ene Nov 19, 2021
176853b
Fix bicleaner
eu9ene Nov 19, 2021
217de12
Fix expands
eu9ene Nov 19, 2021
2e7fb06
Fix tmp dir
eu9ene Nov 19, 2021
2cbec79
Fix merge corpus
eu9ene Nov 19, 2021
3f26532
Fix merge corpus
eu9ene Nov 19, 2021
49ebb88
Use relative paths in cleaning scripts
eu9ene Nov 19, 2021
bb9a333
Use relative paths
eu9ene Nov 19, 2021
915e327
Fix evaluation directory
eu9ene Nov 19, 2021
2fd287f
Fix evaluation
eu9ene Nov 19, 2021
499700a
Add between workflow caching of data
eu9ene Nov 19, 2021
fb80641
Configure test training args per model
eu9ene Nov 20, 2021
619a81f
Move download params
eu9ene Nov 22, 2021
3860feb
Disable caching
eu9ene Nov 22, 2021
039d36d
Add chrf metric to evaluation
eu9ene Nov 23, 2021
c5bc221
Use threads for corpus fixes
eu9ene Nov 23, 2021
0a532ed
Disable cache
eu9ene Nov 23, 2021
0913734
Remove temporary files
eu9ene Nov 23, 2021
ef0f5a9
Remove per dataset deduplication
eu9ene Nov 23, 2021
9319355
Fix bicleaner
eu9ene Nov 24, 2021
f950e3e
Update snakemake
eu9ene Nov 24, 2021
c5dd51e
Fix training args
eu9ene Nov 24, 2021
6f938b5
Limit training of teacher on augmented dataset
eu9ene Nov 24, 2021
cbb1e43
Remove unused code
eu9ene Nov 24, 2021
238b781
Update docs
eu9ene Nov 25, 2021
d07fd59
Fix evaluation report
eu9ene Nov 29, 2021
9354127
Remove unnecessary decompression
eu9ene Nov 30, 2021
cbaa1aa
Use marian score normalization
eu9ene Nov 30, 2021
ccb765f
Refactor model configuration
eu9ene Nov 30, 2021
e396cce
Add testing target
eu9ene Nov 30, 2021
b67e1ba
Add support of running snakemake target
eu9ene Nov 30, 2021
3549d95
Fix snakemake target
eu9ene Nov 30, 2021
43a81e1
Speed up test training
eu9ene Nov 30, 2021
eddef5a
Fix student training
eu9ene Dec 2, 2021
f33acaa
Fix GPU out of memory issue with ensembles
eu9ene Dec 2, 2021
4b3ab03
Fix quantization resources
eu9ene Dec 2, 2021
2b6332f
Fix dry run
eu9ene Dec 2, 2021
18b39ae
Update default stopping settings
eu9ene Dec 2, 2021
21bad0f
Update default stopping criteria
eu9ene Dec 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add experiments tracking and caching interface
  • Loading branch information
eu9ene committed Aug 13, 2021
commit 92205ce590a16204f36770ef6f34e747558d5fab
6 changes: 4 additions & 2 deletions config.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,10 @@ set -a

WORKDIR=$(pwd)
CUDA_DIR=/usr/local/cuda-11.2
DATA_DIR=${DATA_DIR:-${WORKDIR}/data}
MODELS_DIR=${MODELS_DIR:-${WORKDIR}/models}
DATA_ROOT_DIR=${DATA_ROOT_DIR:-${WORKDIR}}
DATA_DIR=${DATA_ROOT_DIR}/data
MODELS_DIR=${DATA_ROOT_DIR}/models
EXPERIMENTS_DIR=${DATA_ROOT_DIR}/experiments
MARIAN=${MARIAN:-${WORKDIR}/3rd_party/marian-dev/build}
CLEAN_TOOLS=${WORKDIR}/pipeline/clean/tools
BIN=${WORKDIR}/bin
Expand Down
1 change: 1 addition & 0 deletions pipeline/data/download-corpus.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ test -v SRC
test -v TRG

prefix=$1
cache=$2

src_corpus="${prefix}.${SRC}.gz"
trg_corpus="${prefix}.${TRG}.gz"
Expand Down
2 changes: 1 addition & 1 deletion pipeline/data/download-eval.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ test -v WORKDIR
test -v TEST_DATASETS

dir=$1

cache=$2

for dataset in "${@:2}"; do
name="${dataset//[^A-Za-z0-9_- ]/_}"
Expand Down
1 change: 1 addition & 0 deletions pipeline/data/download-mono.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ echo "###### Downloading monolingual data"
lang=$1
max_sent=$2
prefix=$3
cache=$4

file_name="${prefix}.${lang}.gz"
dir=$(dirname "${prefix}")/mono
Expand Down
8 changes: 3 additions & 5 deletions pipeline/data/importers/corpus/flores.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,10 @@ mkdir -p "${tmp}"
test -s "${tmp}/flores101_dataset.tar.gz" ||
wget -O "${tmp}/flores101_dataset.tar.gz" "https://dl.fbaipublicfiles.com/flores101/dataset/flores101_dataset.tar.gz"

tar -xzf "${tmp}/flores101_dataset.tar.gz"
tar -xzf "${tmp}/flores101_dataset.tar.gz" -C "${tmp}" --no-same-owner

source "${WORKDIR}/pipeline/setup/activate-python.sh"

trg_flores=$(python -c "from mtdata.iso import iso3_code; print(iso3_code('${trg}', fail_error=True))")

flores_code() {
code=$1

Expand All @@ -46,8 +44,8 @@ flores_code() {
src_flores=$(flores_code "${src}")
trg_flores=$(flores_code "${trg}")

pigz -c "${tmp}/flores101_dataset/${dataset}/${src_flores}.${dataset}" >"${dir}/flores.${src}"
pigz -c "${tmp}/flores101_dataset/${dataset}/${trg_flores}.${dataset}" >"${dir}/flores.${trg}"
cp "${tmp}/flores101_dataset/${dataset}/${src_flores}.${dataset}" "${dir}/flores.${src}"
cp "${tmp}/flores101_dataset/${dataset}/${trg_flores}.${dataset}" "${dir}/flores.${trg}"

rm -rf "${tmp}"

Expand Down
55 changes: 39 additions & 16 deletions run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,15 @@ set -euo pipefail
# Directories structure
#
#├ data
#│ ├ cache TODO
#│ │ └ opus_wmt20.ru.gz
#│ │ └ sacrebleu_wmt20.en.gz
#│ ├ cache
#│ │ ├ corpus
#│ │ │ └ opus
#│ │ │ ├ ada83_v1.en.gz
#│ │ │ └ ada83_v1.ru.gz
#│ │ └ mono
#│ │ └ news-crawl
#│ │ ├ news.2019.ru.gz
#│ │ └ news.2019.en.gz
#│ └ ru-en
#│ └ test
#│ ├ original
Expand Down Expand Up @@ -62,18 +68,21 @@ set -euo pipefail
#│ │ ├ speed
#│ │ └ exported
#│ ├ en-ru
#│ │ └ test
#│ │ └ s2s
#│ └ test
#│ └ s2s
#│
#├ experiments
#│ └ ru-en
#│ └ test
#│ └ config.sh

echo "###### read config "
source ./config.sh

echo "###### setup"
bash ./pipeline/setup/install-all.sh

echo "###### set common variables"
# data
data_dir="${DATA_DIR}/${SRC}-${TRG}/${EXPERIMENT}"
data_dir="${DATA_ROOT_DIR}/data/${SRC}-${TRG}/${EXPERIMENT}"
cache_dir="${DATA_ROOT_DIR}/cache"
original="${data_dir}/original"
evaluation="${data_dir}/evaluation"
clean="${data_dir}/clean"
Expand All @@ -84,22 +93,36 @@ merged="${data_dir}/merged"
filtered="${data_dir}/filtered"
align_dir="${data_dir}/alignment"
# models
models_dir="${MODELS_DIR}/${SRC}-${TRG}/${EXPERIMENT}"
models_dir="${DATA_ROOT_DIR}/models/${SRC}-${TRG}/${EXPERIMENT}"
student_dir="${models_dir}/student"
student_finetuned_dir="${models_dir}/student-finetuned"
teacher_dir="${models_dir}/teacher"
s2s="${MODELS_DIR}/${TRG}-${SRC}/${EXPERIMENT}/s2s"
s2s="${DATA_ROOT_DIR}/models/${TRG}-${SRC}/${EXPERIMENT}/s2s"
speed="${models_dir}/speed"
exported="${models_dir}/exported"

echo "###### save experiment "
experiment_dir="${EXPERIMENTS_DIR}/${SRC}-${TRG}/${EXPERIMENT}"
mkdir -p "${experiment_dir}"
cp ./config.sh "${experiment_dir}/config.sh"
cp -r ./pipeline/translate/configs "${experiment_dir}/"

echo "###### setup"
bash ./pipeline/setup/install-all.sh

echo "###### download data"
bash ./pipeline/data/download-corpus.sh "${original}/corpus" ${TRAIN_DATASETS}
bash ./pipeline/data/download-corpus.sh "${original}/devset" ${DEVTEST_DATASETS}
bash ./pipeline/data/download-eval.sh "${evaluation}" ${TEST_DATASETS}
# shellcheck disable=SC2086
bash ./pipeline/data/download-corpus.sh "${original}/corpus" "${cache_dir}" ${TRAIN_DATASETS}
# shellcheck disable=SC2086
bash ./pipeline/data/download-corpus.sh "${original}/devset" "${cache_dir}" ${DEVTEST_DATASETS}
# shellcheck disable=SC2086
bash ./pipeline/data/download-eval.sh "${evaluation}" "${cache_dir}" ${TEST_DATASETS}
# shellcheck disable=SC2086
test -n "${MONO_DATASETS_SRC}" &&
bash ./pipeline/data/download-mono.sh "${SRC}" "${MONO_MAX_SENTENCES_SRC}" "${original}/mono" ${MONO_DATASETS_SRC}
bash ./pipeline/data/download-mono.sh "${SRC}" "${MONO_MAX_SENTENCES_SRC}" "${original}/mono" "${cache_dir}" ${MONO_DATASETS_SRC}
# shellcheck disable=SC2086
test -n "${MONO_DATASETS_TRG}" &&
bash ./pipeline/data/download-mono.sh "${TRG}" "${MONO_MAX_SENTENCES_TRG}" "${original}/mono" ${MONO_DATASETS_TRG}
bash ./pipeline/data/download-mono.sh "${TRG}" "${MONO_MAX_SENTENCES_TRG}" "${original}/mono" "${cache_dir}" ${MONO_DATASETS_TRG}

echo "###### clean data"
bash ./pipeline/clean/clean-corpus.sh "${original}/corpus" "${clean}/corpus"
Expand Down