Closed
Description
Hi,
I trained language models for 16 languages on Wikipedia dumps + OPUS, that can be integrated into flair
:) This is the result of ~ 2 months work.
Language models
Training data are (a) recent Wikipedia dump and (b) corpora from OPUS. Training was done for one epoch over the full training corpus.
Language (Code) | Tokens (training) | Forward ppl | Backward ppl |
---|---|---|---|
Arabic (ar) | 736,512,400 | 3.39 | 3.45 |
Bulgarian (bg) | 111,336,781 | 2.46 | 2.47 |
Czech (cs) | 442,892,103 | 2.89 | 2.90 |
Danish (da) | 325,816,384 | 2.62 | 2.68 |
Basque (eu) | 36,424,055 | 2.64 | 2.31 |
Persian (fa) | 146,619,206 | 3.68 | 3.66 |
Finnish (fi) | 427,194,262 | 2.63 | 2.65 |
Hebrew (he) | 502,949,245 | 3.84 | 3.87 |
Hindi (hi) | 28,936,996 | 2.87 | 2.86 |
Croatian (hr) | 625,084,958 | 3.13 | 3.20 |
Indonesian (id) | 174,467,241 | 2.80 | 2.74 |
Italian (it) | 1,549,430,560 | 2.62 | 2.63 |
Dutch (nl) | 1,275,949,108 | 2.43 | 2.55 |
Norwegian (no) | 156,076,225 | 3.01 | 3.01 |
Polish (pl) | 1,428,604,528 | 2.95 | 2.84 |
Slovenian (sl) | 419,744,423 | 2.88 | 2.91 |
Swedish (sv) | 671,922,632 | 6.82 | 2.25 |
Download links:
wget https://schweter.eu/cloud/flair-lms/lm-ar-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-ar-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-bg-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-bg-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-cs-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-cs-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-da-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-da-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-eu-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-eu-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fa-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fa-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fi-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fi-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-he-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-he-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hi-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hi-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hr-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hr-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-id-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-id-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-it-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-it-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-nl-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-nl-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-no-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-no-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-pl-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-pl-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sl-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sl-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sv-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sv-opus-large-backward-v0.1.pt
Hyperparameters:
Parameter | Value |
---|---|
hidden_size |
2048 |
nlayers |
1 |
sequence_length |
250 |
mini_batch_size |
100 |
Instead of using common_chars
, all characters (from the training corpus) are used as vocabulary for language model training.
PoS Tagging on Universal Dependencies (v1.2)
To test the new language models on a downstream task, results for PoS tagging on Universal Dependencies (v1.2) are reported (with comparisons to other papers).
Language (Code) | Yu et. al (2017) | Plank et. al (2016) | Yasunaga et. al (2017) | Flair | Δ |
---|---|---|---|---|---|
Arabic (ar) | 99.00 | 98.91 | n.a. | 98.86 | -0.14 |
Bulgarian (bg) | 98.20 | 98.23 | 98.53 | 99.18 | 0.65 |
Czech (cs) | 98.79 | 98.24 | 98.81 | 99.14 | 0.33 |
Danish (da) | 95.92 | 96.35 | 96.74 | 98.48 | 1.74🔥 |
Basque (eu) | 94.94 | 95.51 | 94.71 | 97.30 | 1.79🔥 |
Persian (fa) | 97.12 | 97.60 | 97.51 | 98.15 | 0.55 |
Finnish (fi) | 95.31 | 95.85 | 95.40 | 98.11 | 2.26🔥 |
Hebrew (he) | 96.04 | 96.96 | 97.43 | 97.67 | 0.24 |
Hindi (hi) | 96.96 | 97.10 | 97.21 | 97.85 | 0.64 |
Croatian (hr) | 95.05 | 96.82 | 96.32 | 97.43 | 0.61 |
Indonesian (id) | 93.44 | 93.41 | 94.03 | 93.85 | -0.18 |
Dutch (nl) | 93.11 | 93.82 | 93.09 | 94.03 | 0.21 |
Norwegian (no) | 97.65 | 98.06 | 98.08 | 98.73 | 0.65 |
Polish (pl) | 96.83 | 97.63 | 97.57 | 98.81 | 1.18🔥 |
Slovenian (sl) | 97.16 | 96.97 | 98.11 | 99.02 | 0.91 |
Swedish (sv) | 96.28 | 96.69 | 96.70 | 98.54 | 1.84🔥 |
Hyperparameters:
Parameter | Value |
---|---|
hidden_size |
512 |
learning_rate |
0.1 |
mini_batch_size |
8 |
max_epochs |
500 |
Results on Universal Dependencies show new SOTA, except for Arabic and Indonesian.
Metadata
Metadata
Assignees
Labels
No labels