-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quality improvements #29
Conversation
eu9ene
commented
Nov 25, 2021
•
edited
Loading
edited
- Update teacher hyperparameters ( provided by Kenneth). fixes Try other teacher hyperparameters #18
- Use chrf metric for the best model (https://discourse.translatelocally.com/t/marian-configuration-to-use/24)
- Update SacreBLEU and add chrF metric to evaluation. fixes Add chrF evaluation metric #28
- Add evaluation of each model and the whole ensemble of teachers. fixes Evaluate ensemble of teachers #27
- Early stop based on ce-mean-words instead of bleu-detok Q: Not sure about this one, is it ok to stop basd on ce-mean-words but then use best chrf model?
- Continue teacher training on parallel data only (train on augmented data for 5 epochs first). fixes Fine tune teacher on parallel data only #23 Q: how many epochs should we use? What about low resource languages, should we finetune on a limited corpus?
- Do cleaning per dataset
- Add per-dataset fixes from https://github.com/ZJaume/clean/tree/master/fixes. fixes Add dataset specific fixes #15
- Use bicleaner per dataset with customizable thresholds. fixes Add dataset specific bicleaner thresholds #22 Q: I added some random thresholds based on Kenneth's comments, how to tune them?
- Remove punctuation normalization based on Ulrich's input. fixes Remove punctuation normalizaiton #25
- Add alphabets for more languages in the cleaning scripts. fixes Add more languages to corpus cleaning script #19
- Replace absolute paths with relative ones. fixes Replace absolute paths with relative ones #9
- Add Snakemake cross-workflow caching. Caching works, but apparently, there is a bug in Snakemake, it doesn't recognize symlinks after caching. Disabled for now. fixes Cache datasets across experiments and language pairs #6
Re, Q5: The model optimises towards CE, as opposed to towards BLEU. Stopping training based on the CE stalling criteria is going to mean that you stop training when the model starts to diverge. BLEU/CHFR shouldn't be used as a stopping criteria because they don't correspond to the current state of training. Re 1). The transformer-big configuration should not be used as a hard and fast rule. In general if you have less than 5M sentence pairs you are unlikely to see any benefit from transformer-big. Use transformer base instead. |
This is a very valuable insight. I haven't started with low resource languages yet. Quite a few things must be different there. I'll add transformer base configuration and create another main config for low resource languages as an example. |
just to be clear that is |
pipeline/bicleaner/bicleaner.sh
Outdated
pigz -dc "${output_prefix}.best.gz" | cut -f1 | pigz >"${output_prefix}.${SRC}.gz" | ||
pigz -dc "${output_prefix}.best.gz" | cut -f2 | pigz >"${output_prefix}.${TRG}.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be accomplished in half the time up by using tee
:
pigz -dc "${output_prefix}.best.gz" \
| tee >(cut -f1 | pigz >"${output_prefix}.${SRC}.gz") \
| cut -f2 | pigz >"${output_prefix}.${TRG}.gz"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
I added comments in |
This is looking great. Thanks @eu9ene ! |