Skip to content

Enhanced Natural Language Processing for the Lithuanian language with improved algorithms and features.

Notifications You must be signed in to change notification settings

chikoku/lt-nlp-enhanced

Repository files navigation

The main intention of this research is to study and learn natural language processing (NLP) principles for Lithuanian language. It is interesting to analyze classical NLP methods and see how they work on it, hence, in this work, we implemented text classification, topics extraction, search query, and clustering ideas. Implementation details and further information are stored at paper/paper.pdf\n\n# Introduction\nData analysis can't be established without having textual data. So, our work started from getting raw data from the most popular news website www.delfi.lt. We decided to crawl articles from 5 categories (Criminals[227 articles], Music[120 articles], Movies[167 articles], Sports[136 articles], Science[204 articles]).\n\n# Classification\nClassification performance is measured using a confusion matrix where rows are the true category and columns are predicted category. Furthermore, this approach reaches above 90% recall and 90% precision.\nGitHub Logo\n\n# Topics extraction\nFigure shows 6 components with 10 tokens for each component. From these results, we can detect the most important words and intuitively guess the topic for each principal component. For example, the 4 principal component store information about sports and music, whereas the 6 principal component store information about criminals.\nMain results are presented below:\nGitHub Logo\n\n# Search query\nSearch is based on a study at http://webhome.cs.uvic.ca/~thomo/svd.pdf, where LSA is applied to find related documents using not only exact query similarities but also deeper relations between documents.\nGitHub Logo\n## Example\nQuery = "švietim apdovanojam"\n\nResult:\n\n* ['Imasi mokslininkų algų: siūlo kelti iki 50 proc.']\n* ['Įteiktos 6 Mokslo premijos']\n* ['Lietuvoje į susitikimą kviečia Nobelio premijos laureatas']\n* ['100 tūkst. eurų išdalins populiarinantiems mokslą']\n* ['V. Vaičaitis. Konkursinis mokslo finansavimas ar pasityčiojimas iš mokslininkų?']\n\n# Clustering\nIn progress

About

Enhanced Natural Language Processing for the Lithuanian language with improved algorithms and features.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published