The lexicon is compiled from a modified version of the 500,000 token manually disambiguated morphologically tagged corpus by the Univerity of Tartu (https://www.cl.ut.ee/korpused/morfkorpus/)
The lexicon contains trigrams, token ambiguity classes and probabilities and follows the method described in
[Ingo Schröder. 2001.
A Case Study in Part-of-Speech Tagging Using the ICOPOST Toolkit.
http://acopost.sourceforge.net/schroder2002.pdf].
Input format:
- Every sentence is on a separate line.
- Punctuation marks are separate tokens, delimited by the space character.
- Every token is followed by a disambiguator tag.
Example:
Vaatasin VM1 selja NCSG taha ST . WCP
Disambiguator tags are nothing more than ad hoc units the disambiguator works with.
The disambiguation principles are described in [Heiki-Jaan Kaalep, Tarmo Vaino. Kas vale meetodiga õiged tulemused? Statistikale tuginev eesti keele morfoloogiline ühestamine. Keel ja Kirjandus 1 1998, lk 30-38.]: "Disambiguator tags (DT) need not be equivalent to the tags that the morphological analyzer assigns to word forms. DT should be regarded as part of the intrinsic mechanism of the disambiguator, while its input and output contain only the info from the morphological analyzer. It is possible that words with different morphological tags occur in similar sentential contexts, or that words with similar tags occur in different contexts. So it would make sense to sometimes collate tags under a single umbrella DT, and sometimes split tags into different DTs. For example, collate nouns and proper nouns under one DT, while splitting pronouns into different DTs: personal pronouns vs. all the others."
There are 119 disambiguator tags.
For declinable words, all their cases are collated into 6 groups and depicted inside the tags by the following final symbols: * N - nominative * G - genitive * 1 - partitive * A - inner or outer locative case; i.e. Tartul ja Tartust have the same DT * no symbol - the rest of the semantic cases * X - the word either does not inflect or its case form is unknown, e.g. angoora, 1984, USA
The table lists all the tags and their frequencies in the underlying 500,000 token corpus (https://www.cl.ut.ee/korpused/morfkorpus/)
word class | frequency | tag | explanation and examples |
---|---|---|---|
common noun | 43511 | NCSN | |
48656 | NCSG | ||
23319 | NCS1 | ||
36489 | NCSA | ||
12607 | NCS | ||
106 | NCSX | ||
95 | NPCSX | ||
proper noun | 10847 | NPSN | |
8231 | NPSG | ||
523 | NPS1 | ||
3413 | NPSA | ||
385 | NPS | ||
adjective | 14251 | ASN | |
10791 | ASG | ||
4828 | AS1 | ||
5388 | ASA | ||
2630 | AS | ||
1071 | ASX | ||
cardinal numeral | 1738 | MCSN | |
1107 | MCSG | ||
338 | MCS1 | ||
295 | MCSA | ||
110 | MCS | ||
15917 | MCSX | ||
ordinal numeral | 265 | MOSN | |
244 | MOSG | ||
120 | MOS1 | ||
241 | MOSA | ||
77 | MOS | ||
4341 | MOSX | ||
personal pronoun (1st person) | 2421 | PP1SN | |
1133 | PP1SG | ||
340 | PP1S1 | ||
1008 | PP1SA | ||
58 | PP1S | ||
personal pronoun (2nd person) | 1027 | PP2SN | |
270 | PP2SG | ||
171 | PP2S1 | ||
331 | PP2SA | ||
25 | PP2S | ||
personal pronoun (3rd person) | 5561 | PP3SN | |
2452 | PP3SG | ||
800 | PP3S1 | ||
1628 | PP3SA | ||
121 | PP3S | ||
some other pronoun | 10635 | PSN | |
7062 | PSG | ||
5700 | PS1 | ||
4736 | PSA | ||
959 | PS | ||
2 | PSX | word muist | |
word "üks" | 879 | YKSN | |
474 | YKSG | ||
187 | YKS1 | ||
404 | YKSA | ||
56 | YKS | ||
word "teine" | 405 | TEINESN | |
394 | TEINESG | ||
181 | TEINES1 | ||
494 | TEINESA | ||
70 | TEINES | ||
verb | 4001 | VM1 | indicative mood, 1st person |
1232 | VM2 | indicative mood, 2nd person | |
30177 | VM3 | indicative mood, 3rd person | |
5251 | VMK | imperative mood | |
2628 | VMS | conditional mood | |
362 | VMQ | quotative mood | |
10408 | VMD | infinitive | |
4269 | VMM | supine forms ending in -ma, -mas, -mast | |
444 | VMASS | supine forms ending in -mata | |
4707 | VMP | impersonal voice, positive aspect | |
280 | VMN | impersonal voice, negative aspect, e.g. saadeta | |
13 | VMAP | present participle (ending in -v, -tav) | |
16963 | VMAZ | past participle (ending in -nud, -tud) | |
147 | VMAS | rare forms of past participle, ending in -nudki, -tudki, -nd | |
1849 | VMG | forms ending in -des, -maks | |
10034 | VON | copula/auxiliary form on | |
4450 | VOLI | copula/auxiliary form oli | |
5997 | VME | negation word ei | |
coordinating conjunction | 5954 | CC | word forms &, ega, ehk, ent, ja/või, kuid, või |
19458 | CCJA | words ja, ning, aga | |
283 | CCA | word vaid | |
subordinating conjunction | 6740 | CSRR | words kui, justkui, otsekui, kuigi, nagu |
6718 | CS | words ehkki, et, kuna, kuni, olgugi, sest, siis | |
interjection | 320 | II | |
adverb | 29918 | RR | |
4026 | RRK | ||
6203 | RRM | words ainult, hoopis, iial, jälle, kunagi, maha, nii, nüüd, peaaegu, praegu, rohkem, täiesti, uuesti, väga, äkki, üldse, üles | |
7089 | RRO | ||
2558 | RRY | words kas, kuhu, kuidas, kus, miks, millal | |
637 | RRA | negation word ära | |
adjective/adverb | 274 | ASXRR | words "alasti", "päris", "täis", "valmis" |
preposition | 1372 | SP | precedes a word in partitive case |
185 | SPGP | alla, ligi, peale; precedes a word in genitive or partitive case | |
551 | SPG | läbi, üle, ümber, ümbert(error!); precedes a word in genitive case | |
932 | SPA | alates, hoolimata, koos, kuni, seoses, tänu, vaatamata, vastavalt, ühes; precedes a word in some semantic case | |
postposition | 9531 | ST | follows a word in genitive case |
280 | STGE | läbi, peale; follows a word in genitive case | |
192 | STP | mööda, pidi, tagasi; follows a word in partitive case | |
111 | STA | alates, hoolimata, koos, saadik, seoses, vaatamata, vastavalt; follows a word in some semantic case | |
lühend | 2 | YSN | |
87 | YSG | ||
13 | YS1 | ||
160 | YSA | ||
42 | YS | ||
5358 | YSX | ||
punctuation mark | 7512 | WCB | ] ) |
34730 | WCP | . | |
1726 | WCU | ? | |
1163 | WCX | ! | |
40917 | WIC | , | |
2545 | WID | - | |
567 | WIE | ... | |
1933 | WIL | : | |
2997 | WIM | ; | |
10269 | WIQ | * | |
20 | WIA | / | |
5433 | WOB | [ ( | |
unknown token | 704 | X |