SinEn - Sinhala Language Processing Toolkit

SinEn is an open-source natural language processing toolkit for Sinhala language, written in Python. The toolkit provides several tools for Sinhala language processing, including:

SinEn Bad Words Detector

A machine learning model that can detect offensive or inappropriate words in Singlish text.

SinEn Converter

A software tool that can convert Sinhala script to English transliteration.

SinEn Stemmer

A program that provides stemming functionality for Singlish words, reducing them to their base or root form.

SinEn Tokenizer

A program that can split Singlish sentences into individual words or tokens.

In addition to these tools, the project also includes supporting tools for collecting and processing Sinhala text, including a "Bag of Words Collector" and "Bag of Words Maker".

Installation

To install SinEn, clone the repository and install the required packages using pip:

git clone https://github.com/skyprolk/SinEn-Natural-Language-Tool_Kit.git
pip install art
pip install pygtrie

Usage

To use the SinEn toolkit, import the necessary modules in your Python code:

# Import the SinEnStemmer class from the SinEn_Stemmer module
from sinen_stemmer import SinEnStemmer

# Import the SinEnTokenizer class from the SinEn_Tokenizer module
from sinen_tokenizer import SinEnTokenizer

# Create an instance of the SinEnStemmer class
stemmer = SinEnStemmer()

# Create an instance of the SinEnTokenizer class for splitting the text
tokenizer = SinEnTokenizer()

# Tokenize the text using the SinEnTokenizer object
text = "Mata rupiyal 100k wage mudalak denna puluwanda?"
text = tokenizer.tokenize(text)

# Stem each word in the tokenized text using the SinEnStemmer object
stemmed_words = []
for word in text:
    # Call the stem method to reduce the word to its base form
    stemmed_word = stemmer.stem(word)[0]
    stemmed_words.append(stemmed_word)

# Print the stemmed words to the console
print(stemmed_words) # Output : ['ma', 'rupiyal', 'mudala', 'dena', 'puluwan']

Contributing

We welcome contributions to the SinEn project. If you would like to contribute, please open a pull request with your changes.

License

SinEn is released under the Apache-2.0 license. See LICENSE for more information.

Credits

Developed & Scripted by #KNOiT with Sky Production

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
SinEn_Bad_Words_Detector		SinEn_Bad_Words_Detector
SinEn_Converter		SinEn_Converter
SinEn_Stemmer		SinEn_Stemmer
SinEn_Tokenizer		SinEn_Tokenizer
data		data
img		img
resources		resources
source		source
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
option_menu.py		option_menu.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SinEn - Sinhala Language Processing Toolkit

SinEn Bad Words Detector

SinEn Converter

SinEn Stemmer

SinEn Tokenizer

Installation

Usage

Contributing

License

Credits

About

Contributors 2

Languages

License

skyprolk/SinEn-Natural-Language-Tool_Kit

Folders and files

Latest commit

History

Repository files navigation

SinEn - Sinhala Language Processing Toolkit

SinEn Bad Words Detector

SinEn Converter

SinEn Stemmer

SinEn Tokenizer

Installation

Usage

Contributing

License

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages