GitHub - seth-js/spacy-eo: A spaCy model for Esperanto, trained to provide high-quality linguistic annotations.

A spaCy model for Esperanto, trained to provide high-quality linguistic annotations, including:

Part-of-Speech tagging (POS) via tagger
Lemmatization using trainable_lemmatizer
Morphological feature analysis via morphologizer
Tokenization tuned for Esperanto morphology

This model can be used to simplify vocabulary tracking, generate frequency lists, and support various NLP tasks in Esperanto.

🧙‍♂️ Example

Male, lian vizaĝon ekfendis granda rideto, kaj li diris per voĉo tiel pepeca, ke ĝi rigardigis la preterpasantojn, “Ne ĉagreniĝu, kara sinjoro, ĉar nenio povus min ĝeni hodiaŭ! Ĝoju, ĉar Vi-Scias-Kiu finfine malaperis! Eĉ mogloj kiel vi devus festi en ĉi tiu feliĉiga, feliĉega tago!”

Tokens

Form	Lemma	POS	Morph
Male	male	ADV	—
,	,	PUNCT	—
lian	lia	DET	Case=Acc
vizaĝon	vizaĝo	NOUN	Case=Acc • Number=Sing
ekfendis	ekfendi	VERB	Mood=Ind • Tense=Past
granda	granda	ADJ	Case=Nom • Number=Sing
rideto	rideto	NOUN	Case=Nom • Number=Sing
,	,	PUNCT	—
kaj	kaj	CCONJ	—
li	li	PRON	Case=Nom • Number=Sing
diris	diri	VERB	Mood=Ind • Tense=Past
per	per	ADP	—
voĉo	voĉo	NOUN	Case=Nom • Number=Sing
tiel	tiel	ADV	—
pepeca	pepeca	ADJ	Case=Nom • Number=Sing
,	,	PUNCT	—
ke	ke	SCONJ	—
ĝi	ĝi	PRON	Case=Nom • Number=Sing
rigardigis	rigardigi	VERB	Mood=Ind • Tense=Past
la	la	DET	—
preterpasantojn	preterpasanto	NOUN	Case=Acc • Number=Plur
,	,	PUNCT	—
“	“	PUNCT	—
Ne	ne	PART	—
ĉagreniĝu	ĉagreniĝi	VERB	Mood=Imp
,	,	PUNCT	—
kara	kara	ADJ	Case=Nom • Number=Sing
sinjoro	sinjoro	NOUN	Case=Nom • Number=Sing
,	,	PUNCT	—
ĉar	ĉar	SCONJ	—
nenio	nenio	PRON	Case=Nom • Number=Sing
povus	povi	VERB	Mood=Cnd
min	mi	PRON	Case=Acc • Number=Sing
ĝeni	ĝeni	VERB	—
hodiaŭ	hodiaŭ	ADV	—
!	!	PUNCT	—
Ĝoju	ĝoji	VERB	Mood=Imp
,	,	PUNCT	—
ĉar	ĉar	SCONJ	—
Vi	Vi	PROPN	—
-	-	PROPN	—
Scias	scii	PROPN	—
-	-	PROPN	—
Kiu	kiu	PROPN	—
finfine	finfine	ADV	—
malaperis	malaperi	VERB	Mood=Ind • Tense=Past
!	!	PUNCT	—
Eĉ	eĉ	ADV	—
mogloj	moglo	NOUN	Case=Nom • Number=Plur
kiel	kiel	SCONJ	—
vi	vi	PRON	Case=Nom • Number=Sing
devus	devi	VERB	Mood=Cnd
festi	festi	VERB	—
en	en	ADP	—
ĉi	ĉi	PART	—
tiu	tiu	DET	—
feliĉiga	feliĉiga	ADJ	Case=Nom • Number=Sing
,	,	PUNCT	—
feliĉega	feliĉega	ADJ	Case=Nom • Number=Sing
tago	tago	NOUN	Case=Nom • Number=Sing
!	!	PUNCT	—
”	”	PUNCT	—

📦 Installation

Grab xx_eo_seth-1.0.0.tar.gz from the Releases page.
Run pip install xx_eo_seth-1.0.0.tar.gz.

You can now load the model within your project:

import spacy

nlp = spacy.load("xx_eo_seth")

text = "Do ni devas kunporti buterpanojn, Mumintrolo diris."
doc = nlp(text)

print(text)

for token in doc:
    print(
        token.text,  # surface form
        token.lemma_,  # lemma
        token.pos_,  # POS
        token.morph,  # morphological features
    )

🎯 Accuracy

The Esperanto spaCy model was evaluated on held-out data from the training corpus.

Performance metrics show excellent coverage across all linguistic components:

Component	Accuracy
Part-of-Speech (POS) 🏷️	99.57%
Lemmatization 📝	99.36%
Morphology 🔍	99.47%

📚 Training Corpus

The model was trained with ~20,000 high-confidence sentences from the works below.

Each sentence was automatically annotated using a rule-based morphological analyzer and deinflector.

Wiktextract data was used to handle irregular words and ensure only verified verbs were used during training.

Book Name	Author	Translator(s)
The Magician's Hat	Tove Jansson	Sten Johansson
Comet in Moominvalley	Tove Jansson	Sten Johansson
Moominpappa at Sea	Tove Jansson	Sten Johansson
Moominvalley in November	Tove Jansson	Sten Johansson
The Exploits of Moominpappa	Tove Jansson	Sten Johansson
Moominsummer Madness	Tove Jansson	Sten Johansson
Moominvalley Midwinter	Tove Jansson	Sten Johansson
Harry Potter and the Sorcerer’s Stone	J. K. Rowling	George Baker, Don Harlow
The Wonderful Wizard of Oz	L. Frank Baum	Donald Broadribb
The Hobbit	J. R. R. Tolkien	William Auld, Christopher Gledhill

⚠️ Limitations

The model rarely encountered acronyms, all-caps text, or slangy contractions, so its predictions on such input may be less reliable.
Dependency relations (dep_) are not supported.

🤖 Building the Model

Note

This section is only necessary if you want to build the model yourself. You can read the Installation section if you just want to install it.

The project is split into 4 steps:

Creating the corpus.
- You should convert EPUB/MOBI/FB2 books over to a basic .txt file.
- I manually split the text files by sentence using a simple replace function within my code editor, but you can also just make a script to do this.
- Throw those text files in the texts folder.
- Run node index.ts, and you should see corpus.json generated.
- Look inside to ensure you have an array of sentences.
- Throw the corpus.json file into the step-3 folder to be used later.
Setting up JSON files from kaikki-to-yomitan eo-en data to handle irregular terms and verbs.
- Grab the eo-en dictionary from the kaikki-to-yomitan download page.
- Extract it to its own folder, throw it in the step-2 folder.
- Run node index.ts, and you should see irregular-words.json, plural-pronouns.json, and verbs.json generated.
- Take those three new JSON files, and throw them in the step-3 folder.
Creating the .conllu asset files (these are the files spaCy expects to contain the training data).
- Run pnpm i or whatever your Node.js package manager is.
- Run node index.ts, and you should see eo_seth-train.conllu, eo_seth-dev.conllu, eo_seth-test.conllu generated.
- You can take a look inside those and realize they're just specially formatted text files with your training data for spaCy to work with.
- Make an eo_seth folder within the assets folder inside the step-4 folder.
- Throw the .conllu files into the assets/eo_seth folder.

Creating the model.

Run python -m weasel run preprocess.
Run python -m weasel run train.
Run python -m weasel run package.
Go to the packages/xx_eo_seth-1.0.0/dist/ folder.

Run pip install xx_eo_seth-1.0.0.tar.gz.

You can now load the model within your project:

import spacy

nlp = spacy.load("xx_eo_seth")

text = "Do ni devas kunporti buterpanojn, Mumintrolo diris."
doc = nlp(text)

print(text)

for token in doc:
    print(
        token.text,  # surface form
        token.lemma_,  # lemma
        token.pos_,  # POS
        token.morph,  # morphological features
    )

Note: The configuration for the build can be found in project.yml.

🤝 Acknowledgments

spaCy: for its extensible NLP pipeline and training framework.
wiktextract: for providing lexical data that helped validate irregular forms and verbs.
The translators and authors of the training texts.
Masanori Oya and Dan Zeman: for their Prago .conllu file, which helped me figure out the format.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
step-1-create-corpus		step-1-create-corpus
step-2-handle-irregular-words		step-2-handle-irregular-words
step-3-create-assets		step-3-create-assets
step-4-create-model		step-4-create-model
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧙‍♂️ Example

📦 Installation

🎯 Accuracy

📚 Training Corpus

⚠️ Limitations

🤖 Building the Model

🤝 Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧙‍♂️ Example

📦 Installation

🎯 Accuracy

📚 Training Corpus

⚠️ Limitations

🤖 Building the Model

🤝 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages