Skip to content

SinEn is a Python-based natural language processing toolkit for the Sinhala language in English transliteration.

License

Notifications You must be signed in to change notification settings

skyprolk/SinEn-Natural-Language-Tool_Kit

Repository files navigation

SinEn - Sinhala Language Processing Toolkit

SinEn is an open-source natural language processing toolkit for Sinhala language, written in Python. The toolkit provides several tools for Sinhala language processing, including:

SinEn Bad Words Detector

A machine learning model that can detect offensive or inappropriate words in Singlish text.

SinEn Bad Words Detector

SinEn Converter

A software tool that can convert Sinhala script to English transliteration.

SinEn Converter

SinEn Stemmer

A program that provides stemming functionality for Singlish words, reducing them to their base or root form.

SinEn Stemmer

SinEn Tokenizer

A program that can split Singlish sentences into individual words or tokens.

SinEn Tokenizer

In addition to these tools, the project also includes supporting tools for collecting and processing Sinhala text, including a "Bag of Words Collector" and "Bag of Words Maker".

Installation

To install SinEn, clone the repository and install the required packages using pip:

git clone https://github.com/skyprolk/SinEn-Natural-Language-Tool_Kit.git
pip install art
pip install pygtrie

Usage

To use the SinEn toolkit, import the necessary modules in your Python code:

# Import the SinEnStemmer class from the SinEn_Stemmer module
from sinen_stemmer import SinEnStemmer

# Import the SinEnTokenizer class from the SinEn_Tokenizer module
from sinen_tokenizer import SinEnTokenizer

# Create an instance of the SinEnStemmer class
stemmer = SinEnStemmer()

# Create an instance of the SinEnTokenizer class for splitting the text
tokenizer = SinEnTokenizer()

# Tokenize the text using the SinEnTokenizer object
text = "Mata rupiyal 100k wage mudalak denna puluwanda?"
text = tokenizer.tokenize(text)

# Stem each word in the tokenized text using the SinEnStemmer object
stemmed_words = []
for word in text:
    # Call the stem method to reduce the word to its base form
    stemmed_word = stemmer.stem(word)[0]
    stemmed_words.append(stemmed_word)

# Print the stemmed words to the console
print(stemmed_words) # Output : ['ma', 'rupiyal', 'mudala', 'dena', 'puluwan']

Contributing

We welcome contributions to the SinEn project. If you would like to contribute, please open a pull request with your changes.

License

SinEn is released under the Apache-2.0 license. See LICENSE for more information.

Credits

Developed & Scripted by #KNOiT with Sky Production

About

SinEn is a Python-based natural language processing toolkit for the Sinhala language in English transliteration.

Topics

Resources

License

Stars

Watchers

Forks

Languages