The main intention of this research is to study and learn natural language processing (NLP) principles for Lithuanian language. It is interesting to analyze classical NLP methods and see how they work on it, hence, in this work, we implemented text classification, topics extraction, search query, and clustering ideas. Implementation details and further information are stored at paper/paper.pdf\n\n# Introduction\nData analysis can't be established without having textual data. So, our work started from getting raw data from the most popular news website www.delfi.lt. We decided to crawl articles from 5 categories (Criminals[227 articles], Music[120 articles], Movies[167 articles], Sports[136 articles], Science[204 articles]).\n\n# Classification\nClassification performance is measured using a confusion matrix where rows are the true category and columns are predicted category. Furthermore, this approach reaches above 90% recall and 90% precision.\n\n\n# Topics extraction\nFigure shows 6 components with 10 tokens for each component. From these results, we can detect the most important words and intuitively guess the topic for each principal component. For example, the 4 principal component store information about sports and music, whereas the 6 principal component store information about criminals.\nMain results are presented below:\n
\n\n# Search query\nSearch is based on a study at http://webhome.cs.uvic.ca/~thomo/svd.pdf, where LSA is applied to find related documents using not only exact query similarities but also deeper relations between documents.\n
\n## Example\nQuery = "švietim apdovanojam"\n\nResult:\n\n* ['Imasi mokslininkų algų: siūlo kelti iki 50 proc.']\n* ['Įteiktos 6 Mokslo premijos']\n* ['Lietuvoje į susitikimą kviečia Nobelio premijos laureatas']\n* ['100 tūkst. eurų išdalins populiarinantiems mokslą']\n* ['V. Vaičaitis. Konkursinis mokslo finansavimas ar pasityčiojimas iš mokslininkų?']\n\n# Clustering\nIn progress
-
Notifications
You must be signed in to change notification settings - Fork 0
chikoku/lt-nlp-enhanced
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Enhanced Natural Language Processing for the Lithuanian language with improved algorithms and features.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published