Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality improvements #29

Merged
merged 191 commits into from
Dec 6, 2021
Merged

Quality improvements #29

merged 191 commits into from
Dec 6, 2021

Conversation

eu9ene
Copy link
Collaborator

@eu9ene eu9ene commented Nov 25, 2021

  1. Update teacher hyperparameters ( provided by Kenneth). fixes Try other teacher hyperparameters #18
  2. Use chrf metric for the best model (https://discourse.translatelocally.com/t/marian-configuration-to-use/24)
  3. Update SacreBLEU and add chrF metric to evaluation. fixes Add chrF evaluation metric #28
  4. Add evaluation of each model and the whole ensemble of teachers. fixes Evaluate ensemble of teachers #27
  5. Early stop based on ce-mean-words instead of bleu-detok Q: Not sure about this one, is it ok to stop basd on ce-mean-words but then use best chrf model?
  6. Continue teacher training on parallel data only (train on augmented data for 5 epochs first). fixes Fine tune teacher on parallel data only #23 Q: how many epochs should we use? What about low resource languages, should we finetune on a limited corpus?
  7. Do cleaning per dataset
  8. Add per-dataset fixes from https://github.com/ZJaume/clean/tree/master/fixes. fixes Add dataset specific fixes  #15
  9. Use bicleaner per dataset with customizable thresholds. fixes Add dataset specific bicleaner thresholds #22 Q: I added some random thresholds based on Kenneth's comments, how to tune them?
  10. Remove punctuation normalization based on Ulrich's input. fixes Remove punctuation normalizaiton #25
  11. Add alphabets for more languages in the cleaning scripts. fixes Add more languages to corpus cleaning script #19
  12. Replace absolute paths with relative ones. fixes Replace absolute paths with relative ones #9
  13. Add Snakemake cross-workflow caching. Caching works, but apparently, there is a bug in Snakemake, it doesn't recognize symlinks after caching. Disabled for now. fixes Cache datasets across experiments and language pairs #6

@XapaJIaMnu
Copy link
Contributor

Re, Q5: The model optimises towards CE, as opposed to towards BLEU. Stopping training based on the CE stalling criteria is going to mean that you stop training when the model starts to diverge. BLEU/CHFR shouldn't be used as a stopping criteria because they don't correspond to the current state of training.

Re 1). The transformer-big configuration should not be used as a hard and fast rule. In general if you have less than 5M sentence pairs you are unlikely to see any benefit from transformer-big. Use transformer base instead.

configs/config.test.yml Outdated Show resolved Hide resolved
configs/config.prod.yml Outdated Show resolved Hide resolved
@eu9ene
Copy link
Collaborator Author

eu9ene commented Nov 25, 2021

Re 1). The transformer-big configuration should not be used as a hard and fast rule. In general if you have less than 5M sentence pairs you are unlikely to see any benefit from transformer-big. Use transformer base instead.

This is a very valuable insight. I haven't started with low resource languages yet. Quite a few things must be different there. I'll add transformer base configuration and create another main config for low resource languages as an example.

@XapaJIaMnu
Copy link
Contributor

Re 1). The transformer-big configuration should not be used as a hard and fast rule. In general if you have less than 5M sentence pairs you are unlikely to see any benefit from transformer-big. Use transformer base instead.

This is a very valuable insight. I haven't started with low resource languages yet. Quite a few things must be different there. I'll add transformer base configuration and create another main config for low resource languages as an example.

just to be clear that is task: transformer-base https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L75

Comment on lines 43 to 44
pigz -dc "${output_prefix}.best.gz" | cut -f1 | pigz >"${output_prefix}.${SRC}.gz"
pigz -dc "${output_prefix}.best.gz" | cut -f2 | pigz >"${output_prefix}.${TRG}.gz"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be accomplished in half the time up by using tee:

pigz -dc "${output_prefix}.best.gz" \
| tee >(cut -f1 | pigz >"${output_prefix}.${SRC}.gz") \
| cut -f2 | pigz >"${output_prefix}.${TRG}.gz"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@eu9ene
Copy link
Collaborator Author

eu9ene commented Dec 2, 2021

Re 1). The transformer-big configuration should not be used as a hard and fast rule. In general if you have less than 5M sentence pairs you are unlikely to see any benefit from transformer-big. Use transformer base instead.

This is a very valuable insight. I haven't started with low resource languages yet. Quite a few things must be different there. I'll add transformer base configuration and create another main config for low resource languages as an example.

just to be clear that is task: transformer-base https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L75

I added comments in pipeline/train/configs/model/teacher.yml. So far it's pretty manual to reconfigure everything for another model. When all this is stabilized and we trained a bunch of models, I can think of better ways how to provide ready-to-use recipes for low-resource languages.

@eu9ene eu9ene marked this pull request as ready for review December 2, 2021 03:24
@eu9ene eu9ene requested a review from andrenatal December 6, 2021 20:10
@andrenatal
Copy link
Contributor

This is looking great. Thanks @eu9ene !

@eu9ene eu9ene merged commit 3b3f33b into mozilla:main Dec 6, 2021
@eu9ene eu9ene deleted the quality branch December 6, 2021 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants