Quality improvements #29

eu9ene · 2021-11-25T00:33:45Z

Update teacher hyperparameters ( provided by Kenneth). fixes Try other teacher hyperparameters #18
Use chrf metric for the best model (https://discourse.translatelocally.com/t/marian-configuration-to-use/24)
Update SacreBLEU and add chrF metric to evaluation. fixes Add chrF evaluation metric #28
Add evaluation of each model and the whole ensemble of teachers. fixes Evaluate ensemble of teachers #27
Early stop based on ce-mean-words instead of bleu-detok Q: Not sure about this one, is it ok to stop basd on ce-mean-words but then use best chrf model?
Continue teacher training on parallel data only (train on augmented data for 5 epochs first). fixes Fine tune teacher on parallel data only #23 Q: how many epochs should we use? What about low resource languages, should we finetune on a limited corpus?
Do cleaning per dataset
Add per-dataset fixes from https://github.com/ZJaume/clean/tree/master/fixes. fixes Add dataset specific fixes #15
Use bicleaner per dataset with customizable thresholds. fixes Add dataset specific bicleaner thresholds #22 Q: I added some random thresholds based on Kenneth's comments, how to tune them?
Remove punctuation normalization based on Ulrich's input. fixes Remove punctuation normalizaiton #25
Add alphabets for more languages in the cleaning scripts. fixes Add more languages to corpus cleaning script #19
Replace absolute paths with relative ones. fixes Replace absolute paths with relative ones #9
Add Snakemake cross-workflow caching. Caching works, but apparently, there is a bug in Snakemake, it doesn't recognize symlinks after caching. Disabled for now. fixes Cache datasets across experiments and language pairs #6

XapaJIaMnu · 2021-11-25T15:55:43Z

Re, Q5: The model optimises towards CE, as opposed to towards BLEU. Stopping training based on the CE stalling criteria is going to mean that you stop training when the model starts to diverge. BLEU/CHFR shouldn't be used as a stopping criteria because they don't correspond to the current state of training.

Re 1). The transformer-big configuration should not be used as a hard and fast rule. In general if you have less than 5M sentence pairs you are unlikely to see any benefit from transformer-big. Use transformer base instead.

configs/config.test.yml

configs/config.prod.yml

eu9ene · 2021-11-25T17:47:16Z

Re 1). The transformer-big configuration should not be used as a hard and fast rule. In general if you have less than 5M sentence pairs you are unlikely to see any benefit from transformer-big. Use transformer base instead.

This is a very valuable insight. I haven't started with low resource languages yet. Quite a few things must be different there. I'll add transformer base configuration and create another main config for low resource languages as an example.

XapaJIaMnu · 2021-11-25T19:28:04Z

Re 1). The transformer-big configuration should not be used as a hard and fast rule. In general if you have less than 5M sentence pairs you are unlikely to see any benefit from transformer-big. Use transformer base instead.

This is a very valuable insight. I haven't started with low resource languages yet. Quite a few things must be different there. I'll add transformer base configuration and create another main config for low resource languages as an example.

just to be clear that is task: transformer-base https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L75

ugermann · 2021-11-26T22:24:16Z

pipeline/bicleaner/bicleaner.sh

+pigz -dc "${output_prefix}.best.gz" | cut -f1 | pigz >"${output_prefix}.${SRC}.gz"
+pigz -dc "${output_prefix}.best.gz" | cut -f2 | pigz >"${output_prefix}.${TRG}.gz"


This can be accomplished in half the time up by using tee:

pigz -dc "${output_prefix}.best.gz" \ | tee >(cut -f1 | pigz >"${output_prefix}.${SRC}.gz") \ | cut -f2 | pigz >"${output_prefix}.${TRG}.gz"

pipeline/cefilter/ce-filter.sh

eu9ene · 2021-12-02T03:24:25Z

Re 1). The transformer-big configuration should not be used as a hard and fast rule. In general if you have less than 5M sentence pairs you are unlikely to see any benefit from transformer-big. Use transformer base instead.

This is a very valuable insight. I haven't started with low resource languages yet. Quite a few things must be different there. I'll add transformer base configuration and create another main config for low resource languages as an example.

just to be clear that is task: transformer-base https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L75

I added comments in pipeline/train/configs/model/teacher.yml. So far it's pretty manual to reconfigure everything for another model. When all this is stabilized and we trained a bunch of models, I can think of better ways how to provide ready-to-use recipes for low-resource languages.

andrenatal · 2021-12-06T22:52:38Z

This is looking great. Thanks @eu9ene !

eu9ene added 30 commits August 12, 2021 17:29

Shuffle corpus after merge

ba985dc

Automatically exclude WMT news from opus data

1cb7955

Add flores importer

ca83f89

Add custom importers

2bbc857

Fix condition in corpus translation, replace ls grep

9ba4da8

Add experiments tracking and caching interface

92205ce

Add ability to use pretrained backward model

3780866

Simple snakemake pipeline

e2e23eb

Fix data downloading

139c12c

Use tmp directories in sort

63d17ed

Fix experiment saving

2f492fe

Exclude UN from mtdata

09658a0

Fix corpus merging

9e7bd4b

Fix downloading evaluation datasets

90c8013

Fix run.sh

b345dd9

Use best bleu models

ced12d6

Merge branch 'improvements' into snakemake

31bc4be

Snakemake environment

80ce666

Fix conda activation

5cb7029

Fix conda env

3ed0cce

Remove dry run

9aa940c

Add number of threads

bee95fd

Add monitoring

aeaeaec

Add threads, fix teacher

01cd9da

Remove python env activation

85b231a

Add tmp id for corpus downloading

b532e42

Fix teacher

5358239

Fix output

a083ae4

Use rules outputs

e3f1cea

Remove envs config

ebe1fd4

eu9ene added 2 commits November 24, 2021 15:49

Remove unused code

cbb1e43

Update docs

238b781

XapaJIaMnu reviewed Nov 25, 2021

View reviewed changes

configs/config.test.yml Outdated Show resolved Hide resolved

XapaJIaMnu reviewed Nov 25, 2021

View reviewed changes

configs/config.prod.yml Outdated Show resolved Hide resolved

ugermann reviewed Nov 26, 2021

View reviewed changes

pipeline/cefilter/ce-filter.sh Outdated Show resolved Hide resolved

eu9ene added 13 commits November 29, 2021 15:17

Fix evaluation report

d07fd59

Remove unnecessary decompression

9354127

Use marian score normalization

cbaa1aa

Refactor model configuration

ccb765f

Add testing target

e396cce

Add support of running snakemake target

b67e1ba

Fix snakemake target

3549d95

Speed up test training

43a81e1

Fix student training

eddef5a

Fix GPU out of memory issue with ensembles

f33acaa

Fix quantization resources

4b3ab03

Fix dry run

2b6332f

Update default stopping settings

18b39ae

eu9ene marked this pull request as ready for review December 2, 2021 03:24

Update default stopping criteria

21bad0f

eu9ene requested a review from andrenatal December 6, 2021 20:10

andrenatal approved these changes Dec 6, 2021

View reviewed changes

eu9ene merged commit 3b3f33b into mozilla:main Dec 6, 2021

eu9ene deleted the quality branch December 6, 2021 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality improvements #29

Quality improvements #29

eu9ene commented Nov 25, 2021 •

edited

Loading

XapaJIaMnu commented Nov 25, 2021

eu9ene commented Nov 25, 2021

XapaJIaMnu commented Nov 25, 2021

ugermann Nov 26, 2021

eu9ene Dec 2, 2021

eu9ene commented Dec 2, 2021

andrenatal commented Dec 6, 2021

		pigz -dc "${output_prefix}.best.gz" \| cut -f1 \| pigz >"${output_prefix}.${SRC}.gz"
		pigz -dc "${output_prefix}.best.gz" \| cut -f2 \| pigz >"${output_prefix}.${TRG}.gz"

Quality improvements #29

Quality improvements #29

Conversation

eu9ene commented Nov 25, 2021 • edited Loading

XapaJIaMnu commented Nov 25, 2021

eu9ene commented Nov 25, 2021

XapaJIaMnu commented Nov 25, 2021

ugermann Nov 26, 2021

Choose a reason for hiding this comment

eu9ene Dec 2, 2021

Choose a reason for hiding this comment

eu9ene commented Dec 2, 2021

andrenatal commented Dec 6, 2021

eu9ene commented Nov 25, 2021 •

edited

Loading