CARMLS Hindi

This is a centralized repository of the work of the CARMLS research group on SNACS (Semantic Network of Adposition and Case Supersenses) for Hindi.

Currently, this includes the entire annotated Nanhā Rājkumār (The Little Prince) in Hindi.

Some open guidelines issues are stored (expectedly) as issues on the repo.

Some differences from English STREUSLE annotations

All PRON lexcat gets SNACS annotation. Exceptions for nominative, wh-pronoun, oblique-case pronoun and unmarked reflexive pronouns, are created in PRON.NOM, PRON.WH, PRON.OBL and PRON.REFL exceptions.
PRON lexcat skips validator check where MWE lexlemma must match MWE lemma, given some decisions around indexing the case marker in the lexlemma for irregular pronoun forms.
New PART.FOC lexcat for tokens with UD tag PART which get FOCUS annotations. Negative particles [nahin, na] get ADV lexcat. Other particles get PART lexcat.
Some MWE adpositions can't be validated using the lemma forms, as the tokens [ki, ke] seem to be lemmatized into [ka] but the MWE is always [ki tarah] or [ke/ki upar]. I haven't heard any case myself like [ka upar] or [ka tarah]. So MWE lexlemmas are validated against the raw token, instead of the lemma forms.
NONSNACS label has been removed.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
alignment		alignment
annotations		annotations
prediction		prediction
validation		validation
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
hindi.conllulex		hindi.conllulex
sents.json		sents.json