Skip to content

Python wrapper for Majka morphological analyzer

License

Notifications You must be signed in to change notification settings

petrpulc/python-majka

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python wrapper for Majka

What is Majka

Majka is a linguistics tool for morphology analysis.

Depending on passed dictionaries, the searched word form is translated to a lemma (basic form) and tags determining the linguistic properties of the word or other way around.

For what can be this module used?

The lemmatization of documents is heavily used in text processing, as it simplifies the processing of text in inflective languages.

An example from Czech: 'dělala' (she did) is transformed to 'dělat' (do) and tags determining the past tense, female gender and others are added.

Tags returned from the analyzer that comply with new tagset reference (for example cs, sk) are transcribed into a native Python dictionary to enable a much more Python-like experience without a need to study the documentation. Other or falsely recognized are stored in entry 'other'.

Install / Build instructions

Module is available in PyPi, use pip install majka to install.

For local build / install use:

./setup.py build
./setup.py install

No dependencies outside standard Python and C++ build environment should be needed. (gcc, python-dev, etc.)

Usage

Majka requires a morphological database (automaton) to work. See https://nlp.fi.muni.cz/ma/ for a list of available databases.

import majka
morph = majka.Majka('path/to/database')

morph.flags |= majka.ADD_DIACRITICS  # find word forms with diacritics
morph.flags |= majka.DISALLOW_LOWERCASE  # do not enable to find lowercase variants
morph.flags |= majka.IGNORE_CASE  # ignore the word case whatsoever
morph.flags = 0  # unset all flags

morph.tags = False  # return just the lemma, do not process the tags
morph.tags = True  # turn tag processing back on (default)

morph.compact_tag = True  # return tag in compact form (as returned by Majka)
morph.compact_tag = False  # do not return compact tag (default)

morph.first_only = True  # return only the first entry
morph.first_only = False  # return all entries (default)

morph.find('nejnevhodnější')

Returns

[{'lemma': 'vhodný',
  'tags': {'case': 1,
           'gender': 'feminine',
           'negation': True,
           'plural': True,
           'pos': 'adjective',
           'degree': 3}
 },
...
]

Note on tag translation

Currently, the tag translation to a Python dictionary works only for databases following the Czech and Slovak tag reference. Other languages may return untranslated tags in field other.

Usage with negations

.tags = False causes a transformation of the negation into the lemma itself. By default, "-" sign is prepended, but value can be changed by setting the .negative value.

morph.tags = False
morph.first_only = True
morph.negative = "ne"

morph.find('nejnevhodnější')

Returns

[{'lemma': 'nevhodný'}]

Attributions

The module is based on code of Pavel Smerk and Pavel Rychly, NLP group at MUNI, Czech Republic.

Original majka binary is in majka/majka_bin.cc, see majka/Makefile for build.

Thanks

  • Tomáš Karabela (@tkarabela) for discovering the memory leaks