-
Notifications
You must be signed in to change notification settings - Fork 9
Contribute
Welcome to the wiki page for contribution. If you are interested in extending or improving the Greek class of spaCy or the Greek models (el_core_web_sm, el_core_web_lg) this place is for you.
There are some suggestions on things you could work on if you are interested. There is a difficulty score assigned to each task and an analytical description following for each of them.
First, here is a list of things that you could do to improve the Greek language support for spaCy:
- Add more rules to lemmatizer (difficulty: easy)
- Overwrite the spaCy tokenizer (difficulty: hard)
- Improve models accuracy (difficulty: medium)
- Demo improvements (difficulty: medium)
First, it is highly recommended to have a look here in order to understand the lemmatization approach used in Greek language.
If you are reading this, it is assumed that you have read the wiki page and you are comfortable enough with the approach followed.
Let's assume that you want to lemmatize the following sentence:
"Όταν η συμφορά συμφέρει, λογάριαζε την για πόρνη."
Normally, you would do something like this:
import spacy
nlp = spacy.load('el_core_web_sm')
doc = nlp('Όταν η συμφορά συμφέρει, λογάριαζε την για πόρνη.')
for token in doc:
print(token.tag_, token.lemma_)
The output is:
SCONJ όταν
DET η
NOUN συμφορά
VERB συμφέρω
PUNCT ,
VERB λογάριαζε
PRON την
ADP για
NOUN πόρνη
PUNCT .
You just discovered that it didn't find the correct lemma for the verb "λογάριαζε". This means that one of the following things happened:
-
The POS tag is wrong. The model predicted wrong the POS tag so the transformation rules failed. In our example, this didn't happen but it's actually pretty common. If this happens often, you may need to check the section following about improving models accuracy.
-
There is a rule missing. That's the case here. You need to add this rule (or any other rule missing) to the appropriate category. You have to update this file. "λογάριαζε" is a verb so you should update the VERB_RULES list. The correct lemma is "λογαριάζω", so an appropriate rule to add would be "-άζε" to "-άζω". You should be really careful when adding new rules: The more specific the rule is, the better. Also, check that the new rule is not breaking any of the tests provided (Coming soon).
-
The word is an exception. There are cases in which the transformation rules are correct but they cannot be applied to this token because the word it represents is an exception. There are exceptions for verbs, adjectives, dets, nouns. If you think this is the case for the "wrong" lemmatization you spotted, update the corresponding file. Again, make sure that it doesn't break any of the tests (Coming soon).
Each language modifies the spaCy tokenization procedure by adding tokenizer exceptions.
The tokenizer exceptions approach is not scalable for languages such as Greek. If you are wondering why, have a look here, the reasons are pretty much the same.
A new approach, rule-based tokenization is proposed. The repo owners are currently working on that, but any help is welcome. The steps we are taking are:
- Rewrite the spaCy tokenizer in pure Python, following the pseudo-code provided here. This is already done, you can find the code here.
- Write regex expressions to catch the following phenomena of Greek language: "εκθλίψεις", "αφαιρέσεις", "αποκοπές". This is an ongoing process.
- Transform the tokens that match one of the phenomena mentioned above, to other(s) tokens using transformation rules. This is an ongoing process.
Greek models support DEP, POS and NER tagger.
POS/DEP tagger are trained on the Greek UD Treebank dataset and thus there are no instructions for futher training from our side.
For detailed instructions on how to train NER tagger you should have a look in the Prodigy Wiki page.
Our Demo is NLPbuddy.
For Demo improvements, check NLPBuddy Contribute Wiki page.