Marathi Tokenizer is an open-source project for tokenizing Marathi text using Byte Pair Encoding (BPE). Based on Hugging Face's tokenizers
library, it is optimized to handle the linguistic nuances of the Marathi language. This tokenizer is ideal for tasks such as language modeling, machine translation, and other natural language processing (NLP) applications.
Example usage with the trained tokenizer:
(It is a gated repository, so for using this model, you will have to get an access token from HuggingFace.)
from tokenizers import Tokenizer
hfToken = "<HUGGINGFACE_TOKEN>"
tokenizer: Tokenizer = Tokenizer.from_pretrained("NotShrirang/marathi-tokenizer", token=hfToken)
text = "मराठी भाषा ही भारतातील एक प्रमुख भाषा आहे."
encoded = tokenizer.encode(text)
print("Encoded token IDs:", encoded)
print("Decoded text:", tokenizer.decode(encoded))
- 🧠 BPE Encoding: Leverages Byte Pair Encoding to tokenize Marathi text into subword units efficiently.
- 📚 Multi-Dataset Support: Trained on a diverse set of Marathi datasets, including news articles, conversational text, and more.
- 🔍 Custom Vocabulary: Supports configurable vocabulary size and frequency thresholds for token inclusion.
- 📦 Hugging Face Integration: Fully compatible with Hugging Face's
PreTrainedTokenizerFast
for seamless NLP pipeline integration. - ⚡ Efficient Training: Optimized for fast and scalable training using Hugging Face datasets.
- 🌐 Unicode Support: Handles complex Marathi characters and ligatures seamlessly.
-
Dataset Preparation:
- Combines multiple datasets from Hugging Face, including:
- Marathi news articles
- Conversational text
- Instructional datasets
- Preprocesses datasets to standardize text and remove irrelevant columns.
- Combines multiple datasets from Hugging Face, including:
-
BPE Tokenizer Training:
- Trains a Byte Pair Encoding tokenizer with configurable parameters, such as vocabulary size and minimum token frequency.
- Tokenizer is saved in Hugging Face's
PreTrainedTokenizerFast
format for compatibility.
-
Tokenization Workflow:
- Custom rules ensure optimal tokenization for Marathi language structure.
- Provides robust encoding and decoding capabilities.
-
Output:
- Tokenized text as subword units.
- Saved tokenizer files for integration into NLP pipelines.
-
Clone the repository:
git clone https://github.com/NotShrirang/marathi-tokenizer.git cd marathi-tokenizer
-
Install dependencies:
pip install -r requirements.txt
-
Training the Tokenizer:
Use the provided script
tokenizer.py
to train the BPE tokenizer on predefined datasets:python tokenizer.py
The tokenizer will be saved in the
marathi_bpe_tokenizer
directory. -
Sample Encoding:
A sample usage script
sample.py
is included for testing the tokenizer:python sample.py
- Vocabulary Size: Default is set to
32768
. - Minimum Frequency: Default is
2
, configurable for token inclusion. - Datasets: Combines multiple Marathi datasets for comprehensive coverage.
Contributions are welcome! Feel free to open an issue or submit a pull request for feature requests, bug fixes, or improvements.
This project is licensed under the MIT License. See the LICENSE file for details.