🖋️ Marathi Tokenizer: A BPE-Based Tokenizer for Marathi Language

Marathi Tokenizer is an open-source project for tokenizing Marathi text using Byte Pair Encoding (BPE). Based on Hugging Face's tokenizers library, it is optimized to handle the linguistic nuances of the Marathi language. This tokenizer is ideal for tasks such as language modeling, machine translation, and other natural language processing (NLP) applications.

📖 Usage:

Example usage with the trained tokenizer:

(It is a gated repository, so for using this model, you will have to get an access token from HuggingFace.)

from tokenizers import Tokenizer

hfToken = "<HUGGINGFACE_TOKEN>"

tokenizer: Tokenizer = Tokenizer.from_pretrained("NotShrirang/marathi-tokenizer", token=hfToken)

text = "मराठी भाषा ही भारतातील एक प्रमुख भाषा आहे."

encoded = tokenizer.encode(text)

print("Encoded token IDs:", encoded)
print("Decoded text:", tokenizer.decode(encoded))

✨ Features

🧠 BPE Encoding: Leverages Byte Pair Encoding to tokenize Marathi text into subword units efficiently.
📚 Multi-Dataset Support: Trained on a diverse set of Marathi datasets, including news articles, conversational text, and more.
🔍 Custom Vocabulary: Supports configurable vocabulary size and frequency thresholds for token inclusion.
📦 Hugging Face Integration: Fully compatible with Hugging Face's PreTrainedTokenizerFast for seamless NLP pipeline integration.
⚡ Efficient Training: Optimized for fast and scalable training using Hugging Face datasets.
🌐 Unicode Support: Handles complex Marathi characters and ligatures seamlessly.

🏗️ Architecture Overview

Dataset Preparation:
- Combines multiple datasets from Hugging Face, including:
  - Marathi news articles
  - Conversational text
  - Instructional datasets
- Preprocesses datasets to standardize text and remove irrelevant columns.
BPE Tokenizer Training:
- Trains a Byte Pair Encoding tokenizer with configurable parameters, such as vocabulary size and minimum token frequency.
- Tokenizer is saved in Hugging Face's PreTrainedTokenizerFast format for compatibility.
Tokenization Workflow:
- Custom rules ensure optimal tokenization for Marathi language structure.
- Provides robust encoding and decoding capabilities.
Output:
- Tokenized text as subword units.
- Saved tokenizer files for integration into NLP pipelines.

🚀 Installation

Clone the repository:

git clone https://github.com/NotShrirang/marathi-tokenizer.git
cd marathi-tokenizer

Install dependencies:
```
pip install -r requirements.txt
```

🔨 Training

Training the Tokenizer:

Use the provided script tokenizer.py to train the BPE tokenizer on predefined datasets:
```
python tokenizer.py
```
The tokenizer will be saved in the marathi_bpe_tokenizer directory.
Sample Encoding:

A sample usage script sample.py is included for testing the tokenizer:
```
python sample.py
```

⚙️ Configuration

Vocabulary Size: Default is set to 32768.
Minimum Frequency: Default is 2, configurable for token inclusion.
Datasets: Combines multiple Marathi datasets for comprehensive coverage.

🤝 Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request for feature requests, bug fixes, or improvements.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
push.py		push.py
sample.py		sample.py
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🖋️ Marathi Tokenizer: A BPE-Based Tokenizer for Marathi Language

📖 Usage:

✨ Features

🏗️ Architecture Overview

🚀 Installation

🔨 Training

⚙️ Configuration

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Uh oh!

Languages

License

NotShrirang/marathi-tokenizer

Folders and files

Latest commit

History

Repository files navigation

🖋️ Marathi Tokenizer: A BPE-Based Tokenizer for Marathi Language

📖 Usage:

✨ Features

🏗️ Architecture Overview

🚀 Installation

🔨 Training

⚙️ Configuration

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages