Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using larger student models #174

Closed
Tracked by #216
marco-c opened this issue Aug 30, 2023 · 7 comments
Closed
Tracked by #216

Investigate using larger student models #174

marco-c opened this issue Aug 30, 2023 · 7 comments
Assignees
Labels
experiment A training experiment with hypothesis and results quality Improving robustness and translation quality

Comments

@marco-c
Copy link
Collaborator

marco-c commented Aug 30, 2023

Current models are around ~20 MB, this is very little for Desktop. We could try to go with larger models and see what kind of quality improvements we manage to get.

@marco-c marco-c added the quality Improving robustness and translation quality label Aug 30, 2023
@marco-c marco-c added the experiment A training experiment with hypothesis and results label Oct 10, 2024
@eu9ene eu9ene self-assigned this Oct 10, 2024
@eu9ene
Copy link
Collaborator

eu9ene commented Oct 10, 2024

I will try training the student Large or Base configurations from https://aclanthology.org/D19-5632.pdf. Our current configuration is Tiny.

Screenshot 2024-10-10 at 11 58 31 AM

@eu9ene
Copy link
Collaborator

eu9ene commented Oct 10, 2024

I looked at the recommended configurations in https://github.com/browsermt/students/tree/master/train-student/models and what HPLT folks are training here https://github.com/hplt-project/bitextor-mt-models/tree/main and it seems the recommended approach to train a larger model is to go with the Base configuration. The difference I see in code is basically these two model parameters:

base

    dim-emb: 512
    transformer-dim-ffn: 2048

tiny

    dim-emb: 256
    transformer-dim-ffn: 1536

@eu9ene
Copy link
Collaborator

eu9ene commented Oct 10, 2024

It can give ~2 BLEU points bump and is twice slower, for example here https://github.com/hplt-project/bitextor-mt-models/tree/main/swa-eng.

I launched training for en-ru student in Base configuration:
https://firefox-ci-tc.services.mozilla.com/tasks/P8eqvTKvRp640O8mcMZtWw/runs/0
https://wandb.ai/moz-translations/en-ru/runs/mjnbhst5?nw=nwuserepavlov

I also reduced early-stopping from 20 to 10. See #864

@eu9ene
Copy link
Collaborator

eu9ene commented Oct 11, 2024

It looks pretty good so far! The gap is significantly smaller on the validation set now.

Screenshot 2024-10-11 at 9 34 44 AM

@eu9ene
Copy link
Collaborator

eu9ene commented Oct 15, 2024

The student in base config has finished training and the metrics look a lot better:

The base configuration improved COMET by +2.9.

config flores-test COMET Early stopping Training Time Size
tiny quantized 84.88 16 MB
tiny student 85.62 20 15 days 80 MB
base quantized 87.78 41 MB
base student 87.90 10 4 days 176 MB
teacher ensemble 89.17 30 9 days 801 MB x 2
google 90.23
Screenshot 2024-10-15 at 12 29 14 PM

yellow student - tiny
purple student - base

@gregtatum
Copy link
Member

For clarity here, since I was confused for a bit, the tiny and base don't quite match up with the paper, as the paper's base model also increases the decoder depth. So here the terms have slightly different meanings.

@eu9ene eu9ene closed this as completed Oct 18, 2024
@gregtatum
Copy link
Member

(I updated the summary comment for some clarity on the result)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment A training experiment with hypothesis and results quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

3 participants