Investigate using larger student models #174

marco-c · 2023-08-30T10:13:05Z

Current models are around ~20 MB, this is very little for Desktop. We could try to go with larger models and see what kind of quality improvements we manage to get.

eu9ene · 2024-10-10T18:59:33Z

I will try training the student Large or Base configurations from https://aclanthology.org/D19-5632.pdf. Our current configuration is Tiny.

eu9ene · 2024-10-10T22:59:26Z

I looked at the recommended configurations in https://github.com/browsermt/students/tree/master/train-student/models and what HPLT folks are training here https://github.com/hplt-project/bitextor-mt-models/tree/main and it seems the recommended approach to train a larger model is to go with the Base configuration. The difference I see in code is basically these two model parameters:

base

    dim-emb: 512
    transformer-dim-ffn: 2048

tiny

    dim-emb: 256
    transformer-dim-ffn: 1536

eu9ene · 2024-10-10T23:16:43Z

It can give ~2 BLEU points bump and is twice slower, for example here https://github.com/hplt-project/bitextor-mt-models/tree/main/swa-eng.

I launched training for en-ru student in Base configuration:
https://firefox-ci-tc.services.mozilla.com/tasks/P8eqvTKvRp640O8mcMZtWw/runs/0
https://wandb.ai/moz-translations/en-ru/runs/mjnbhst5?nw=nwuserepavlov

I also reduced early-stopping from 20 to 10. See #864

eu9ene · 2024-10-11T16:36:09Z

It looks pretty good so far! The gap is significantly smaller on the validation set now.

eu9ene · 2024-10-15T19:43:26Z

The student in base config has finished training and the metrics look a lot better:

The base configuration improved COMET by +2.9.

config	flores-test COMET	Early stopping	Training Time	Size
tiny quantized	84.88			16 MB
tiny student	85.62	20	15 days	80 MB
base quantized	87.78			41 MB
base student	87.90	10	4 days	176 MB
teacher ensemble	89.17	30	9 days	801 MB x 2
google	90.23

yellow student - tiny
purple student - base

gregtatum · 2024-10-18T15:24:11Z

For clarity here, since I was confused for a bit, the tiny and base don't quite match up with the paper, as the paper's base model also increases the decoder depth. So here the terms have slightly different meanings.

gregtatum · 2024-10-22T13:51:40Z

(I updated the summary comment for some clarity on the result)

marco-c added the quality Improving robustness and translation quality label Aug 30, 2023

marco-c mentioned this issue Aug 30, 2023

Add comparisons with teacher models #175

Closed

eu9ene mentioned this issue Oct 4, 2023

[meta] General translation quality improvements #216

Open

marco-c added the experiment A training experiment with hypothesis and results label Oct 10, 2024

eu9ene self-assigned this Oct 10, 2024

eu9ene mentioned this issue Oct 15, 2024

Add student base configuration option #881

Merged

eu9ene closed this as completed Oct 18, 2024

gregtatum mentioned this issue Oct 22, 2024

Experiment with student model parameters #894

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate using larger student models #174

Investigate using larger student models #174

marco-c commented Aug 30, 2023

eu9ene commented Oct 10, 2024 •

edited

Loading

eu9ene commented Oct 10, 2024

eu9ene commented Oct 10, 2024 •

edited

Loading

eu9ene commented Oct 11, 2024

eu9ene commented Oct 15, 2024 •

edited by gregtatum

Loading

gregtatum commented Oct 18, 2024

gregtatum commented Oct 22, 2024

Investigate using larger student models #174

Investigate using larger student models #174

Comments

marco-c commented Aug 30, 2023

eu9ene commented Oct 10, 2024 • edited Loading

eu9ene commented Oct 10, 2024

eu9ene commented Oct 10, 2024 • edited Loading

eu9ene commented Oct 11, 2024

eu9ene commented Oct 15, 2024 • edited by gregtatum Loading

gregtatum commented Oct 18, 2024

gregtatum commented Oct 22, 2024

eu9ene commented Oct 10, 2024 •

edited

Loading

eu9ene commented Oct 10, 2024 •

edited

Loading

eu9ene commented Oct 15, 2024 •

edited by gregtatum

Loading