Majka is a linguistics tool for morphology analysis.
Depending on passed dictionaries, the searched word form is translated to a lemma (basic form) and tags determining the linguistic properties of the word or other way around.
The lemmatization of documents is heavily used in text processing, as it simplifies the processing of text in inflective languages.
An example from Czech: 'dělala' (she did) is transformed to 'dělat' (do) and tags determining the past tense, female gender and others are added.
Tags returned from the analyzer that comply with new tagset reference (for example cs, sk) are transcribed into a native Python dictionary to enable a much more Python-like experience without a need to study the documentation. Other or falsely recognized are stored in entry 'other'.
Module is available in PyPi, use pip install majka
to install.
For local build / install use:
./setup.py build
./setup.py install
No dependencies outside standard Python and C++ build environment should be needed. (gcc, python-dev, etc.)
Majka requires a morphological database (automaton) to work. See https://nlp.fi.muni.cz/ma/ for a list of available databases.
import majka
morph = majka.Majka('path/to/database')
morph.flags |= majka.ADD_DIACRITICS # find word forms with diacritics
morph.flags |= majka.DISALLOW_LOWERCASE # do not enable to find lowercase variants
morph.flags |= majka.IGNORE_CASE # ignore the word case whatsoever
morph.flags = 0 # unset all flags
morph.tags = False # return just the lemma, do not process the tags
morph.tags = True # turn tag processing back on (default)
morph.compact_tag = True # return tag in compact form (as returned by Majka)
morph.compact_tag = False # do not return compact tag (default)
morph.first_only = True # return only the first entry
morph.first_only = False # return all entries (default)
morph.find('nejnevhodnější')
[{'lemma': 'vhodný',
'tags': {'case': 1,
'gender': 'feminine',
'negation': True,
'plural': True,
'pos': 'adjective',
'degree': 3}
},
...
]
Currently, the tag translation to a Python dictionary works only for databases following the Czech and Slovak tag reference. Other languages may return untranslated tags in field other
.
.tags = False
causes a transformation of the negation into the lemma itself. By default, "-" sign is prepended, but value can be changed by setting the .negative
value.
morph.tags = False
morph.first_only = True
morph.negative = "ne"
morph.find('nejnevhodnější')
[{'lemma': 'nevhodný'}]
The module is based on code of Pavel Smerk and Pavel Rychly, NLP group at MUNI, Czech Republic.
Original majka
binary is in majka/majka_bin.cc
, see majka/Makefile
for build.
- Tomáš Karabela (@tkarabela) for discovering the memory leaks