Skip to content

Latin default package doesn't usually lemmatize words starting with a capital letter #1330

@pseudomonas

Description

@pseudomonas

Latin default package (ITTB) doesn't usually lemmatize words starting with a capital letter. This seems to be the case whether the word is a proper noun, normally capitalised (eg "Iacobi"), a common word that is extraordinarily capitalised, or a word capitalised out of devotion (eg "Deo"). This seems to be a systematic problem though in the example below "Erat" is lemmatized to "sum"; I have not done any digging into what might provoke this behaviour.

To Reproduce
see code below

Environment (please complete the following information):

  • OS: Ubuntu
  • Python version: Conda Python 3.10.9
  • Stanza version: 1.6.1
import stanza
latindefault = stanza.Pipeline('la', processors='tokenize,pos,lemma' )
#%%


sent = "Quod Erat Demonstrandum" 

print(latindefault(sent))

#### Correctly diagnoses parts of speech; does not lemmatize.
 # {
 #      "id": 3,
 #      "text": "Demonstrandum",
 #      "lemma": "Demonstrandum",
 #      "upos": "VERB",
 #      "xpos": "J2|modO|grp1|casA|gen3",
 #      "feats": "Aspect=Prosp|Case=Nom|Gender=Neut|InflClass=LatA|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass",
 #      "start_char": 10,
 #      "end_char": 23
 #    }

print(latindefault(sent.lower()))
#### Correctly diagnoses parts of speech and lemmatizes.

# {
#       "id": 3,
#       "text": "demonstrandum",
#       "lemma": "demonstro",
#       "upos": "VERB",
#       "xpos": "J2|modO|grp1|casA|gen3",
#       "feats": "Aspect=Prosp|Case=Nom|Gender=Neut|InflClass=LatA|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass",
#       "start_char": 10,
#       "end_char": 23
#     }

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions