Skip to content

seth-js/spacy-eo

Repository files navigation

A spaCy model for Esperanto, trained to provide high-quality linguistic annotations, including:

  • Part-of-Speech tagging (POS) via tagger
  • Lemmatization using trainable_lemmatizer
  • Morphological feature analysis via morphologizer
  • Tokenization tuned for Esperanto morphology

This model can be used to simplify vocabulary tracking, generate frequency lists, and support various NLP tasks in Esperanto.

🧙‍♂️ Example

Male, lian vizaĝon ekfendis granda rideto, kaj li diris per voĉo tiel pepeca, ke ĝi rigardigis la preterpasantojn, “Ne ĉagreniĝu, kara sinjoro, ĉar nenio povus min ĝeni hodiaŭ! Ĝoju, ĉar Vi-Scias-Kiu finfine malaperis! Eĉ mogloj kiel vi devus festi en ĉi tiu feliĉiga, feliĉega tago!”

Tokens
Form Lemma POS Morph
Male male ADV
, , PUNCT
lian lia DET Case=Acc
vizaĝon vizaĝo NOUN Case=Acc • Number=Sing
ekfendis ekfendi VERB Mood=Ind • Tense=Past
granda granda ADJ Case=Nom • Number=Sing
rideto rideto NOUN Case=Nom • Number=Sing
, , PUNCT
kaj kaj CCONJ
li li PRON Case=Nom • Number=Sing
diris diri VERB Mood=Ind • Tense=Past
per per ADP
voĉo voĉo NOUN Case=Nom • Number=Sing
tiel tiel ADV
pepeca pepeca ADJ Case=Nom • Number=Sing
, , PUNCT
ke ke SCONJ
ĝi ĝi PRON Case=Nom • Number=Sing
rigardigis rigardigi VERB Mood=Ind • Tense=Past
la la DET
preterpasantojn preterpasanto NOUN Case=Acc • Number=Plur
, , PUNCT
PUNCT
Ne ne PART
ĉagreniĝu ĉagreniĝi VERB Mood=Imp
, , PUNCT
kara kara ADJ Case=Nom • Number=Sing
sinjoro sinjoro NOUN Case=Nom • Number=Sing
, , PUNCT
ĉar ĉar SCONJ
nenio nenio PRON Case=Nom • Number=Sing
povus povi VERB Mood=Cnd
min mi PRON Case=Acc • Number=Sing
ĝeni ĝeni VERB
hodiaŭ hodiaŭ ADV
! ! PUNCT
Ĝoju ĝoji VERB Mood=Imp
, , PUNCT
ĉar ĉar SCONJ
Vi Vi PROPN
- - PROPN
Scias scii PROPN
- - PROPN
Kiu kiu PROPN
finfine finfine ADV
malaperis malaperi VERB Mood=Ind • Tense=Past
! ! PUNCT
ADV
mogloj moglo NOUN Case=Nom • Number=Plur
kiel kiel SCONJ
vi vi PRON Case=Nom • Number=Sing
devus devi VERB Mood=Cnd
festi festi VERB
en en ADP
ĉi ĉi PART
tiu tiu DET
feliĉiga feliĉiga ADJ Case=Nom • Number=Sing
, , PUNCT
feliĉega feliĉega ADJ Case=Nom • Number=Sing
tago tago NOUN Case=Nom • Number=Sing
! ! PUNCT
PUNCT

📦 Installation

  1. Grab xx_eo_seth-1.0.0.tar.gz from the Releases page.
  2. Run pip install xx_eo_seth-1.0.0.tar.gz.

You can now load the model within your project:

import spacy

nlp = spacy.load("xx_eo_seth")

text = "Do ni devas kunporti buterpanojn, Mumintrolo diris."
doc = nlp(text)

print(text)

for token in doc:
    print(
        token.text,  # surface form
        token.lemma_,  # lemma
        token.pos_,  # POS
        token.morph,  # morphological features
    )

🎯 Accuracy

The Esperanto spaCy model was evaluated on held-out data from the training corpus.

Performance metrics show excellent coverage across all linguistic components:

Component Accuracy
Part-of-Speech (POS) 🏷️ 99.57%
Lemmatization 📝 99.36%
Morphology 🔍 99.47%

📚 Training Corpus

The model was trained with ~20,000 high-confidence sentences from the works below.

Each sentence was automatically annotated using a rule-based morphological analyzer and deinflector.

Wiktextract data was used to handle irregular words and ensure only verified verbs were used during training.

Book Name Author Translator(s)
The Magician's Hat Tove Jansson Sten Johansson
Comet in Moominvalley Tove Jansson Sten Johansson
Moominpappa at Sea Tove Jansson Sten Johansson
Moominvalley in November Tove Jansson Sten Johansson
The Exploits of Moominpappa Tove Jansson Sten Johansson
Moominsummer Madness Tove Jansson Sten Johansson
Moominvalley Midwinter Tove Jansson Sten Johansson
Harry Potter and the Sorcerer’s Stone J. K. Rowling George Baker, Don Harlow
The Wonderful Wizard of Oz L. Frank Baum Donald Broadribb
The Hobbit J. R. R. Tolkien William Auld, Christopher Gledhill

⚠️ Limitations

  • The model rarely encountered acronyms, all-caps text, or slangy contractions, so its predictions on such input may be less reliable.
  • Dependency relations (dep_) are not supported.

🤖 Building the Model

Note

This section is only necessary if you want to build the model yourself. You can read the Installation section if you just want to install it.

The project is split into 4 steps:

  1. Creating the corpus.

    • You should convert EPUB/MOBI/FB2 books over to a basic .txt file.
    • I manually split the text files by sentence using a simple replace function within my code editor, but you can also just make a script to do this.
    • Throw those text files in the texts folder.
    • Run node index.ts, and you should see corpus.json generated.
    • Look inside to ensure you have an array of sentences.
    • Throw the corpus.json file into the step-3 folder to be used later.
  2. Setting up JSON files from kaikki-to-yomitan eo-en data to handle irregular terms and verbs.

    • Grab the eo-en dictionary from the kaikki-to-yomitan download page.
    • Extract it to its own folder, throw it in the step-2 folder.
    • Run node index.ts, and you should see irregular-words.json, plural-pronouns.json, and verbs.json generated.
    • Take those three new JSON files, and throw them in the step-3 folder.
  3. Creating the .conllu asset files (these are the files spaCy expects to contain the training data).

    • Run pnpm i or whatever your Node.js package manager is.
    • Run node index.ts, and you should see eo_seth-train.conllu, eo_seth-dev.conllu, eo_seth-test.conllu generated.
    • You can take a look inside those and realize they're just specially formatted text files with your training data for spaCy to work with.
    • Make an eo_seth folder within the assets folder inside the step-4 folder.
    • Throw the .conllu files into the assets/eo_seth folder.
  4. Creating the model.

    • Run python -m weasel run preprocess.

    • Run python -m weasel run train.

    • Run python -m weasel run package.

    • Go to the packages/xx_eo_seth-1.0.0/dist/ folder.

    • Run pip install xx_eo_seth-1.0.0.tar.gz.

      You can now load the model within your project:

      import spacy
      
      nlp = spacy.load("xx_eo_seth")
      
      text = "Do ni devas kunporti buterpanojn, Mumintrolo diris."
      doc = nlp(text)
      
      print(text)
      
      for token in doc:
          print(
              token.text,  # surface form
              token.lemma_,  # lemma
              token.pos_,  # POS
              token.morph,  # morphological features
          )

      Note: The configuration for the build can be found in project.yml.

🤝 Acknowledgments

  • spaCy: for its extensible NLP pipeline and training framework.
  • wiktextract: for providing lexical data that helped validate irregular forms and verbs.
  • The translators and authors of the training texts.
  • Masanori Oya and Dan Zeman: for their Prago .conllu file, which helped me figure out the format.

About

A spaCy model for Esperanto, trained to provide high-quality linguistic annotations.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors