Skip to content

Latest commit

 

History

History
174 lines (160 loc) · 11.4 KB

disambiguator_lexicon_description.md

File metadata and controls

174 lines (160 loc) · 11.4 KB

Disambiguator lexicon of Vabamorf

The lexicon is compiled from a modified version of the 500,000 token manually disambiguated morphologically tagged corpus by the Univerity of Tartu (https://www.cl.ut.ee/korpused/morfkorpus/)

The lexicon contains trigrams, token ambiguity classes and probabilities and follows the method described in
[Ingo Schröder. 2001. A Case Study in Part-of-Speech Tagging Using the ICOPOST Toolkit. http://acopost.sourceforge.net/schroder2002.pdf].

Input format:

  • Every sentence is on a separate line.
  • Punctuation marks are separate tokens, delimited by the space character.
  • Every token is followed by a disambiguator tag.

Example:

Vaatasin VM1 selja NCSG taha ST . WCP

Disambiguator tags

Disambiguator tags are nothing more than ad hoc units the disambiguator works with.

The disambiguation principles are described in [Heiki-Jaan Kaalep, Tarmo Vaino. Kas vale meetodiga õiged tulemused? Statistikale tuginev eesti keele morfoloogiline ühestamine. Keel ja Kirjandus 1 1998, lk 30-38.]: "Disambiguator tags (DT) need not be equivalent to the tags that the morphological analyzer assigns to word forms. DT should be regarded as part of the intrinsic mechanism of the disambiguator, while its input and output contain only the info from the morphological analyzer. It is possible that words with different morphological tags occur in similar sentential contexts, or that words with similar tags occur in different contexts. So it would make sense to sometimes collate tags under a single umbrella DT, and sometimes split tags into different DTs. For example, collate nouns and proper nouns under one DT, while splitting pronouns into different DTs: personal pronouns vs. all the others."

There are 119 disambiguator tags.

For declinable words, all their cases are collated into 6 groups and depicted inside the tags by the following final symbols: * N - nominative * G - genitive * 1 - partitive * A - inner or outer locative case; i.e. Tartul ja Tartust have the same DT * no symbol - the rest of the semantic cases * X - the word either does not inflect or its case form is unknown, e.g. angoora, 1984, USA

The table lists all the tags and their frequencies in the underlying 500,000 token corpus (https://www.cl.ut.ee/korpused/morfkorpus/)

word class frequency tag explanation and examples
common noun 43511 NCSN
  48656 NCSG
  23319 NCS1
  36489 NCSA
  12607 NCS
  106 NCSX
  95 NPCSX
proper noun 10847 NPSN
  8231 NPSG
  523 NPS1
  3413 NPSA
  385 NPS
adjective 14251 ASN
  10791 ASG
  4828 AS1
  5388 ASA
  2630 AS
  1071 ASX
cardinal numeral 1738 MCSN
  1107 MCSG
  338 MCS1
  295 MCSA
  110 MCS
  15917 MCSX
ordinal numeral 265 MOSN
  244 MOSG
  120 MOS1
  241 MOSA
  77 MOS
  4341 MOSX
personal pronoun (1st person) 2421 PP1SN
  1133 PP1SG
  340 PP1S1
  1008 PP1SA
  58 PP1S
personal pronoun (2nd person) 1027 PP2SN
  270 PP2SG
  171 PP2S1
  331 PP2SA
  25 PP2S
personal pronoun (3rd person) 5561 PP3SN
  2452 PP3SG
  800 PP3S1
  1628 PP3SA
  121 PP3S
some other pronoun 10635 PSN
  7062 PSG
  5700 PS1
  4736 PSA
  959 PS
  2 PSX word muist
word "üks" 879 YKSN
  474 YKSG
  187 YKS1
  404 YKSA
  56 YKS
word "teine" 405 TEINESN
  394 TEINESG
  181 TEINES1
  494 TEINESA
  70 TEINES
verb 4001 VM1 indicative mood, 1st person
  1232 VM2 indicative mood, 2nd person
  30177 VM3 indicative mood, 3rd person
  5251 VMK imperative mood
  2628 VMS conditional mood
  362 VMQ quotative mood
  10408 VMD infinitive
  4269 VMM supine forms ending in -ma, -mas, -mast
  444 VMASS supine forms ending in -mata
  4707 VMP impersonal voice, positive aspect
  280 VMN impersonal voice, negative aspect, e.g. saadeta
  13 VMAP present participle (ending in -v, -tav)
  16963 VMAZ past participle (ending in -nud, -tud)
  147 VMAS rare forms of past participle, ending in -nudki, -tudki, -nd
  1849 VMG forms ending in -des, -maks
  10034 VON copula/auxiliary form on
  4450 VOLI copula/auxiliary form oli
  5997 VME negation word ei
coordinating conjunction 5954 CC word forms &, ega, ehk, ent, ja/või, kuid, või
  19458 CCJA words ja, ning, aga
  283 CCA word vaid
subordinating conjunction 6740 CSRR words kui, justkui, otsekui, kuigi, nagu
  6718 CS words ehkki, et, kuna, kuni, olgugi, sest, siis
interjection 320 II
adverb 29918 RR
  4026 RRK
  6203 RRM words ainult, hoopis, iial, jälle, kunagi, maha, nii, nüüd, peaaegu, praegu, rohkem, täiesti, uuesti, väga, äkki, üldse, üles
  7089 RRO
  2558 RRY words kas, kuhu, kuidas, kus, miks, millal
  637 RRA negation word ära
adjective/adverb 274 ASXRR words "alasti", "päris", "täis", "valmis"
preposition 1372 SP precedes a word in partitive case
  185 SPGP alla, ligi, peale; precedes a word in genitive or partitive case
  551 SPG läbi, üle, ümber, ümbert(error!); precedes a word in genitive case
  932 SPA alates, hoolimata, koos, kuni, seoses, tänu, vaatamata, vastavalt, ühes; precedes a word in some semantic case
postposition 9531 ST follows a word in genitive case
  280 STGE läbi, peale; follows a word in genitive case
  192 STP mööda, pidi, tagasi; follows a word in partitive case
  111 STA alates, hoolimata, koos, saadik, seoses, vaatamata, vastavalt; follows a word in some semantic case
lühend 2 YSN
  87 YSG
  13 YS1
  160 YSA
  42 YS
  5358 YSX
punctuation mark 7512 WCB ] )
  34730 WCP .
  1726 WCU ?
  1163 WCX !
  40917 WIC ,
  2545 WID -
  567 WIE ...
  1933 WIL :
  2997 WIM ;
  10269 WIQ *
  20 WIA /
  5433 WOB [ (
unknown token 704 X