A spaCy model for Esperanto, trained to provide high-quality linguistic annotations, including:
- Part-of-Speech tagging (POS) via
tagger - Lemmatization using
trainable_lemmatizer - Morphological feature analysis via
morphologizer - Tokenization tuned for Esperanto morphology
This model can be used to simplify vocabulary tracking, generate frequency lists, and support various NLP tasks in Esperanto.
Male, lian vizaĝon ekfendis granda rideto, kaj li diris per voĉo tiel pepeca, ke ĝi rigardigis la preterpasantojn, “Ne ĉagreniĝu, kara sinjoro, ĉar nenio povus min ĝeni hodiaŭ! Ĝoju, ĉar Vi-Scias-Kiu finfine malaperis! Eĉ mogloj kiel vi devus festi en ĉi tiu feliĉiga, feliĉega tago!”
Tokens
| Form | Lemma | POS | Morph |
|---|---|---|---|
| Male | male | ADV | — |
| , | , | PUNCT | — |
| lian | lia | DET | Case=Acc |
| vizaĝon | vizaĝo | NOUN | Case=Acc • Number=Sing |
| ekfendis | ekfendi | VERB | Mood=Ind • Tense=Past |
| granda | granda | ADJ | Case=Nom • Number=Sing |
| rideto | rideto | NOUN | Case=Nom • Number=Sing |
| , | , | PUNCT | — |
| kaj | kaj | CCONJ | — |
| li | li | PRON | Case=Nom • Number=Sing |
| diris | diri | VERB | Mood=Ind • Tense=Past |
| per | per | ADP | — |
| voĉo | voĉo | NOUN | Case=Nom • Number=Sing |
| tiel | tiel | ADV | — |
| pepeca | pepeca | ADJ | Case=Nom • Number=Sing |
| , | , | PUNCT | — |
| ke | ke | SCONJ | — |
| ĝi | ĝi | PRON | Case=Nom • Number=Sing |
| rigardigis | rigardigi | VERB | Mood=Ind • Tense=Past |
| la | la | DET | — |
| preterpasantojn | preterpasanto | NOUN | Case=Acc • Number=Plur |
| , | , | PUNCT | — |
| “ | “ | PUNCT | — |
| Ne | ne | PART | — |
| ĉagreniĝu | ĉagreniĝi | VERB | Mood=Imp |
| , | , | PUNCT | — |
| kara | kara | ADJ | Case=Nom • Number=Sing |
| sinjoro | sinjoro | NOUN | Case=Nom • Number=Sing |
| , | , | PUNCT | — |
| ĉar | ĉar | SCONJ | — |
| nenio | nenio | PRON | Case=Nom • Number=Sing |
| povus | povi | VERB | Mood=Cnd |
| min | mi | PRON | Case=Acc • Number=Sing |
| ĝeni | ĝeni | VERB | — |
| hodiaŭ | hodiaŭ | ADV | — |
| ! | ! | PUNCT | — |
| Ĝoju | ĝoji | VERB | Mood=Imp |
| , | , | PUNCT | — |
| ĉar | ĉar | SCONJ | — |
| Vi | Vi | PROPN | — |
| - | - | PROPN | — |
| Scias | scii | PROPN | — |
| - | - | PROPN | — |
| Kiu | kiu | PROPN | — |
| finfine | finfine | ADV | — |
| malaperis | malaperi | VERB | Mood=Ind • Tense=Past |
| ! | ! | PUNCT | — |
| Eĉ | eĉ | ADV | — |
| mogloj | moglo | NOUN | Case=Nom • Number=Plur |
| kiel | kiel | SCONJ | — |
| vi | vi | PRON | Case=Nom • Number=Sing |
| devus | devi | VERB | Mood=Cnd |
| festi | festi | VERB | — |
| en | en | ADP | — |
| ĉi | ĉi | PART | — |
| tiu | tiu | DET | — |
| feliĉiga | feliĉiga | ADJ | Case=Nom • Number=Sing |
| , | , | PUNCT | — |
| feliĉega | feliĉega | ADJ | Case=Nom • Number=Sing |
| tago | tago | NOUN | Case=Nom • Number=Sing |
| ! | ! | PUNCT | — |
| ” | ” | PUNCT | — |
- Grab
xx_eo_seth-1.0.0.tar.gzfrom the Releases page. - Run
pip install xx_eo_seth-1.0.0.tar.gz.
You can now load the model within your project:
import spacy
nlp = spacy.load("xx_eo_seth")
text = "Do ni devas kunporti buterpanojn, Mumintrolo diris."
doc = nlp(text)
print(text)
for token in doc:
print(
token.text, # surface form
token.lemma_, # lemma
token.pos_, # POS
token.morph, # morphological features
)The Esperanto spaCy model was evaluated on held-out data from the training corpus.
Performance metrics show excellent coverage across all linguistic components:
| Component | Accuracy |
|---|---|
| Part-of-Speech (POS) 🏷️ | 99.57% |
| Lemmatization 📝 | 99.36% |
| Morphology 🔍 | 99.47% |
The model was trained with ~20,000 high-confidence sentences from the works below.
Each sentence was automatically annotated using a rule-based morphological analyzer and deinflector.
Wiktextract data was used to handle irregular words and ensure only verified verbs were used during training.
| Book Name | Author | Translator(s) |
|---|---|---|
| The Magician's Hat | Tove Jansson | Sten Johansson |
| Comet in Moominvalley | Tove Jansson | Sten Johansson |
| Moominpappa at Sea | Tove Jansson | Sten Johansson |
| Moominvalley in November | Tove Jansson | Sten Johansson |
| The Exploits of Moominpappa | Tove Jansson | Sten Johansson |
| Moominsummer Madness | Tove Jansson | Sten Johansson |
| Moominvalley Midwinter | Tove Jansson | Sten Johansson |
| Harry Potter and the Sorcerer’s Stone | J. K. Rowling | George Baker, Don Harlow |
| The Wonderful Wizard of Oz | L. Frank Baum | Donald Broadribb |
| The Hobbit | J. R. R. Tolkien | William Auld, Christopher Gledhill |
- The model rarely encountered acronyms, all-caps text, or slangy contractions, so its predictions on such input may be less reliable.
- Dependency relations (
dep_) are not supported.
Note
This section is only necessary if you want to build the model yourself. You can read the Installation section if you just want to install it.
The project is split into 4 steps:
-
Creating the corpus.
- You should convert EPUB/MOBI/FB2 books over to a basic
.txtfile. - I manually split the text files by sentence using a simple replace function within my code editor, but you can also just make a script to do this.
- Throw those text files in the
textsfolder. - Run
node index.ts, and you should seecorpus.jsongenerated. - Look inside to ensure you have an array of sentences.
- Throw the
corpus.jsonfile into thestep-3folder to be used later.
- You should convert EPUB/MOBI/FB2 books over to a basic
-
Setting up JSON files from kaikki-to-yomitan
eo-endata to handle irregular terms and verbs.- Grab the
eo-endictionary from the kaikki-to-yomitan download page. - Extract it to its own folder, throw it in the
step-2folder. - Run
node index.ts, and you should seeirregular-words.json,plural-pronouns.json, andverbs.jsongenerated. - Take those three new JSON files, and throw them in the
step-3folder.
- Grab the
-
Creating the
.conlluasset files (these are the files spaCy expects to contain the training data).- Run
pnpm ior whatever your Node.js package manager is. - Run
node index.ts, and you should seeeo_seth-train.conllu,eo_seth-dev.conllu,eo_seth-test.conllugenerated. - You can take a look inside those and realize they're just specially formatted text files with your training data for spaCy to work with.
- Make an
eo_sethfolder within theassetsfolder inside thestep-4folder. - Throw the
.conllufiles into theassets/eo_sethfolder.
- Run
-
Creating the model.
-
Run
python -m weasel run preprocess. -
Run
python -m weasel run train. -
Run
python -m weasel run package. -
Go to the
packages/xx_eo_seth-1.0.0/dist/folder. -
Run
pip install xx_eo_seth-1.0.0.tar.gz.You can now load the model within your project:
import spacy nlp = spacy.load("xx_eo_seth") text = "Do ni devas kunporti buterpanojn, Mumintrolo diris." doc = nlp(text) print(text) for token in doc: print( token.text, # surface form token.lemma_, # lemma token.pos_, # POS token.morph, # morphological features )
Note: The configuration for the build can be found in
project.yml.
-
- spaCy: for its extensible NLP pipeline and training framework.
- wiktextract: for providing lexical data that helped validate irregular forms and verbs.
- The translators and authors of the training texts.
- Masanori Oya and Dan Zeman: for their Prago
.conllufile, which helped me figure out the format.