Open
Description
I think the teacher-big
task configuration used here
and here
can be optimized.
Regarding the speed of training:
beam-size
to 4 should be enough for transformer-big models.valid-mini-batch
to 8 is a bit low, could be set to 32 or 64.
Regarding quality:
max-length
is set to 100, which is pretty low on my opinion (I typically use 400). Specially if we are including sentences from certain EU corpora from OPUS that have very long lines and segments from HPLT in backtranslation (remember that HPLT does not have sentences splitted, they are in the corpus just as they appear in the original website). Thevalid-max-length
is set to 300, which is ok, butmax-length
to 100 is causing all the training sentences over 100 tokens to be omitted, so the model is not learning from them (unless I'm missing a third configuration file in the pipeline).- I've been using
swish
always with no issues, but maybe there's no difference in usingrelu
.