Skip to content

Multilingual Language Models #614

Closed
@stefan-it

Description

@stefan-it

Hi,

I trained language models for 16 languages on Wikipedia dumps + OPUS, that can be integrated into flair :) This is the result of ~ 2 months work.

Language models

Training data are (a) recent Wikipedia dump and (b) corpora from OPUS. Training was done for one epoch over the full training corpus.

Language (Code) Tokens (training) Forward ppl Backward ppl
Arabic (ar) 736,512,400 3.39 3.45
Bulgarian (bg) 111,336,781 2.46 2.47
Czech (cs) 442,892,103 2.89 2.90
Danish (da) 325,816,384 2.62 2.68
Basque (eu) 36,424,055 2.64 2.31
Persian (fa) 146,619,206 3.68 3.66
Finnish (fi) 427,194,262 2.63 2.65
Hebrew (he) 502,949,245 3.84 3.87
Hindi (hi) 28,936,996 2.87 2.86
Croatian (hr) 625,084,958 3.13 3.20
Indonesian (id) 174,467,241 2.80 2.74
Italian (it) 1,549,430,560 2.62 2.63
Dutch (nl) 1,275,949,108 2.43 2.55
Norwegian (no) 156,076,225 3.01 3.01
Polish (pl) 1,428,604,528 2.95 2.84
Slovenian (sl) 419,744,423 2.88 2.91
Swedish (sv) 671,922,632 6.82 2.25

Download links:

wget https://schweter.eu/cloud/flair-lms/lm-ar-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-ar-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-bg-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-bg-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-cs-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-cs-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-da-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-da-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-eu-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-eu-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fa-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fa-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fi-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-fi-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-he-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-he-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hi-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hi-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hr-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-hr-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-id-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-id-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-it-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-it-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-nl-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-nl-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-no-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-no-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-pl-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-pl-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sl-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sl-opus-large-backward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sv-opus-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sv-opus-large-backward-v0.1.pt

Hyperparameters:

Parameter Value
hidden_size 2048
nlayers 1
sequence_length 250
mini_batch_size 100

Instead of using common_chars, all characters (from the training corpus) are used as vocabulary for language model training.

PoS Tagging on Universal Dependencies (v1.2)

To test the new language models on a downstream task, results for PoS tagging on Universal Dependencies (v1.2) are reported (with comparisons to other papers).

Language (Code) Yu et. al (2017) Plank et. al (2016) Yasunaga et. al (2017) Flair Δ
Arabic (ar) 99.00 98.91 n.a. 98.86 -0.14
Bulgarian (bg) 98.20 98.23 98.53 99.18 0.65
Czech (cs) 98.79 98.24 98.81 99.14 0.33
Danish (da) 95.92 96.35 96.74 98.48 1.74🔥
Basque (eu) 94.94 95.51 94.71 97.30 1.79🔥
Persian (fa) 97.12 97.60 97.51 98.15 0.55
Finnish (fi) 95.31 95.85 95.40 98.11 2.26🔥
Hebrew (he) 96.04 96.96 97.43 97.67 0.24
Hindi (hi) 96.96 97.10 97.21 97.85 0.64
Croatian (hr) 95.05 96.82 96.32 97.43 0.61
Indonesian (id) 93.44 93.41 94.03 93.85 -0.18
Dutch (nl) 93.11 93.82 93.09 94.03 0.21
Norwegian (no) 97.65 98.06 98.08 98.73 0.65
Polish (pl) 96.83 97.63 97.57 98.81 1.18🔥
Slovenian (sl) 97.16 96.97 98.11 99.02 0.91
Swedish (sv) 96.28 96.69 96.70 98.54 1.84🔥

Hyperparameters:

Parameter Value
hidden_size 512
learning_rate 0.1
mini_batch_size 8
max_epochs 500

Results on Universal Dependencies show new SOTA, except for Arabic and Indonesian.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions