-
Notifications
You must be signed in to change notification settings - Fork 911
Closed
Labels
Description
Latin default package (ITTB) doesn't usually lemmatize words starting with a capital letter. This seems to be the case whether the word is a proper noun, normally capitalised (eg "Iacobi"), a common word that is extraordinarily capitalised, or a word capitalised out of devotion (eg "Deo"). This seems to be a systematic problem though in the example below "Erat" is lemmatized to "sum"; I have not done any digging into what might provoke this behaviour.
To Reproduce
see code below
Environment (please complete the following information):
- OS: Ubuntu
- Python version: Conda Python 3.10.9
- Stanza version: 1.6.1
import stanza
latindefault = stanza.Pipeline('la', processors='tokenize,pos,lemma' )
#%%
sent = "Quod Erat Demonstrandum"
print(latindefault(sent))
#### Correctly diagnoses parts of speech; does not lemmatize.
# {
# "id": 3,
# "text": "Demonstrandum",
# "lemma": "Demonstrandum",
# "upos": "VERB",
# "xpos": "J2|modO|grp1|casA|gen3",
# "feats": "Aspect=Prosp|Case=Nom|Gender=Neut|InflClass=LatA|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass",
# "start_char": 10,
# "end_char": 23
# }
print(latindefault(sent.lower()))
#### Correctly diagnoses parts of speech and lemmatizes.
# {
# "id": 3,
# "text": "demonstrandum",
# "lemma": "demonstro",
# "upos": "VERB",
# "xpos": "J2|modO|grp1|casA|gen3",
# "feats": "Aspect=Prosp|Case=Nom|Gender=Neut|InflClass=LatA|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass",
# "start_char": 10,
# "end_char": 23
# }