Disambiguator lexicon of Vabamorf

The lexicon is compiled from a modified version of the 500,000 token manually disambiguated morphologically tagged corpus by the Univerity of Tartu (https://www.cl.ut.ee/korpused/morfkorpus/)

The lexicon contains trigrams, token ambiguity classes and probabilities and follows the method described in
[Ingo Schröder. 2001. A Case Study in Part-of-Speech Tagging Using the ICOPOST Toolkit. http://acopost.sourceforge.net/schroder2002.pdf].

Input format:

Every sentence is on a separate line.
Punctuation marks are separate tokens, delimited by the space character.
Every token is followed by a disambiguator tag.

Example:

Vaatasin VM1 selja NCSG taha ST . WCP

Disambiguator tags

Disambiguator tags are nothing more than ad hoc units the disambiguator works with.

The disambiguation principles are described in [Heiki-Jaan Kaalep, Tarmo Vaino. Kas vale meetodiga õiged tulemused? Statistikale tuginev eesti keele morfoloogiline ühestamine. Keel ja Kirjandus 1 1998, lk 30-38.]: "Disambiguator tags (DT) need not be equivalent to the tags that the morphological analyzer assigns to word forms. DT should be regarded as part of the intrinsic mechanism of the disambiguator, while its input and output contain only the info from the morphological analyzer. It is possible that words with different morphological tags occur in similar sentential contexts, or that words with similar tags occur in different contexts. So it would make sense to sometimes collate tags under a single umbrella DT, and sometimes split tags into different DTs. For example, collate nouns and proper nouns under one DT, while splitting pronouns into different DTs: personal pronouns vs. all the others."

There are 119 disambiguator tags.

For declinable words, all their cases are collated into 6 groups and depicted inside the tags by the following final symbols: * N - nominative * G - genitive * 1 - partitive * A - inner or outer locative case; i.e. Tartul ja Tartust have the same DT * no symbol - the rest of the semantic cases * X - the word either does not inflect or its case form is unknown, e.g. angoora, 1984, USA

The table lists all the tags and their frequencies in the underlying 500,000 token corpus (https://www.cl.ut.ee/korpused/morfkorpus/)

word class	frequency	tag	explanation and examples
common noun	43511	NCSN
	48656	NCSG
	23319	NCS1
	36489	NCSA
	12607	NCS
	106	NCSX
	95	NPCSX
proper noun	10847	NPSN
	8231	NPSG
	523	NPS1
	3413	NPSA
	385	NPS
adjective	14251	ASN
	10791	ASG
	4828	AS1
	5388	ASA
	2630	AS
	1071	ASX
cardinal numeral	1738	MCSN
	1107	MCSG
	338	MCS1
	295	MCSA
	110	MCS
	15917	MCSX
ordinal numeral	265	MOSN
	244	MOSG
	120	MOS1
	241	MOSA
	77	MOS
	4341	MOSX
personal pronoun (1st person)	2421	PP1SN
	1133	PP1SG
	340	PP1S1
	1008	PP1SA
	58	PP1S
personal pronoun (2nd person)	1027	PP2SN
	270	PP2SG
	171	PP2S1
	331	PP2SA
	25	PP2S
personal pronoun (3rd person)	5561	PP3SN
	2452	PP3SG
	800	PP3S1
	1628	PP3SA
	121	PP3S
some other pronoun	10635	PSN
	7062	PSG
	5700	PS1
	4736	PSA
	959	PS
	2	PSX	word muist
word "üks"	879	YKSN
	474	YKSG
	187	YKS1
	404	YKSA
	56	YKS
word "teine"	405	TEINESN
	394	TEINESG
	181	TEINES1
	494	TEINESA
	70	TEINES
verb	4001	VM1	indicative mood, 1st person
	1232	VM2	indicative mood, 2nd person
	30177	VM3	indicative mood, 3rd person
	5251	VMK	imperative mood
	2628	VMS	conditional mood
	362	VMQ	quotative mood
	10408	VMD	infinitive
	4269	VMM	supine forms ending in -ma, -mas, -mast
	444	VMASS	supine forms ending in -mata
	4707	VMP	impersonal voice, positive aspect
	280	VMN	impersonal voice, negative aspect, e.g. saadeta
	13	VMAP	present participle (ending in -v, -tav)
	16963	VMAZ	past participle (ending in -nud, -tud)
	147	VMAS	rare forms of past participle, ending in -nudki, -tudki, -nd
	1849	VMG	forms ending in -des, -maks
	10034	VON	copula/auxiliary form on
	4450	VOLI	copula/auxiliary form oli
	5997	VME	negation word ei
coordinating conjunction	5954	CC	word forms &, ega, ehk, ent, ja/või, kuid, või
	19458	CCJA	words ja, ning, aga
	283	CCA	word vaid
subordinating conjunction	6740	CSRR	words kui, justkui, otsekui, kuigi, nagu
	6718	CS	words ehkki, et, kuna, kuni, olgugi, sest, siis
interjection	320	II
adverb	29918	RR
	4026	RRK
	6203	RRM	words ainult, hoopis, iial, jälle, kunagi, maha, nii, nüüd, peaaegu, praegu, rohkem, täiesti, uuesti, väga, äkki, üldse, üles
	7089	RRO
	2558	RRY	words kas, kuhu, kuidas, kus, miks, millal
	637	RRA	negation word ära
adjective/adverb	274	ASXRR	words "alasti", "päris", "täis", "valmis"
preposition	1372	SP	precedes a word in partitive case
	185	SPGP	alla, ligi, peale; precedes a word in genitive or partitive case
	551	SPG	läbi, üle, ümber, ümbert(error!); precedes a word in genitive case
	932	SPA	alates, hoolimata, koos, kuni, seoses, tänu, vaatamata, vastavalt, ühes; precedes a word in some semantic case
postposition	9531	ST	follows a word in genitive case
	280	STGE	läbi, peale; follows a word in genitive case
	192	STP	mööda, pidi, tagasi; follows a word in partitive case
	111	STA	alates, hoolimata, koos, saadik, seoses, vaatamata, vastavalt; follows a word in some semantic case
lühend	2	YSN
	87	YSG
	13	YS1
	160	YSA
	42	YS
	5358	YSX
punctuation mark	7512	WCB	] )
	34730	WCP	.
	1726	WCU	?
	1163	WCX	!
	40917	WIC	,
	2545	WID	-
	567	WIE	...
	1933	WIL	:
	2997	WIM	;
	10269	WIQ	*
	20	WIA	/
	5433	WOB	[ (
unknown token	704	X

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disambiguator_lexicon_description.md

disambiguator_lexicon_description.md

Disambiguator lexicon of Vabamorf

Disambiguator tags

Files

disambiguator_lexicon_description.md

Latest commit

History

disambiguator_lexicon_description.md

File metadata and controls

Disambiguator lexicon of Vabamorf

Disambiguator tags