Build software better, together

alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go, C++ & Javascript

tokenizer vocabulary vocabulary-builder tokenize tokenization tokenisation tokenizing text-tokenization vocabulary-generator

Updated Jul 10, 2026
Go

twardoch / split-markdown4gpt

Star

A Python tool for splitting large Markdown files into smaller sections based on a specified token limit. This is particularly useful for processing large Markdown files with GPT models, as it allows the models to handle the data in manageable chunks.

Updated Jun 29, 2026
Python

Mecanik / Modern-Text-Tokenizer

Sponsor

Star

Modern UTF-8 aware C++ tokenizer with vocabulary support, ideal for NLP and transformer models. Header-only and zero-dependency.

nlp machine-learning natural-language-processing ai deep-learning high-performance tokenizer modern-cpp vocabulary text-analysis artificial-intelligence transformer header-only text-processing preprocessing bert text-encoding distilbert text-tokenization

Updated Aug 7, 2025
C++

SayamAlt / Resume-Classification-using-fine-tuned-BERT

Star

Successfully developed a resume classification model which can accurately classify the resume of any person into its corresponding job with a tremendously high accuracy of more than 99%.

nlp exploratory-data-analysis word-embeddings model-evaluation text-preprocessing bert-model text-tokenization fine-tuning-bert

Updated Jan 13, 2023
Jupyter Notebook

hikmatazimzade / azerbaijani-tokenizer

Star

High-Performance Azerbaijani Tokenizers (30% fewer tokens, 40% faster than multilingual alternatives)

nlp ai tokenizer transformers tokenization unigram bpe huggingface azerbaijani text-tokenization azerbaijani-language azerbaijani-nlp

Updated Aug 19, 2025
Jupyter Notebook

SayamAlt / Symptoms-Disease-Text-Classification

Star

Successfully developed a fine-tuned BERT transformer model which can accurately classify symptoms to their corresponding diseases upto an accuracy of 89%.

natural-language-processing text-classification exploratory-data-analysis multiclass-classification text-preprocessing text-tokenization bert-fine-tuning hugging-face-transformers fine-tune-bert-tensorflow model-inference model-architecture-and-implementation model-training-and-evaluation data-exploration-and-preprocessing

Updated May 6, 2024
Jupyter Notebook

markiskorova / Machine-Learning-NLP-Predict-Author

Star

🧠 Machine Learning & Natural Language Processing: Predict the author of literary text snippets. Built with TensorFlow and Keras, this project trains an LSTM model on classic literature to identify writing style and authorship.

python machine-learning natural-language-processing tensorflow keras text-vectorization text-tokenization

Updated May 25, 2025
Python

SayamAlt / Cyberbullying-Classification-using-fine-tuned-DistilBERT

Star

Successfully fine-tuned a pretrained DistilBERT transformer model that can classify social media text data into one of 4 cyberbullying labels i.e. ethnicity/race, gender/sexual, religion and not cyberbullying with a remarkable accuracy of 99%.

natural-language-processing text-classification exploratory-data-analysis data-exploration multiclass-classification cyberbullying-detection text-preprocessing text-tokenization distilbert-model llm fine-tune-bert-tensorflow model-inference model-training-and-evaluation

Updated Jun 10, 2024
Jupyter Notebook

SayamAlt / Financial-News-Sentiment-Analysis

Star

Successfully developed a fine-tuned DistilBERT transformer model which can accurately predict the overall sentiment of a piece of financial news up to an accuracy of nearly 81.5%.

natural-language-processing sentiment-analysis multiclass-classification text-preprocessing text-tokenization distilbert-model hugging-face-transformers fine-tune-bert-tensorflow model-inference model-architecture-and-implementation model-training-and-evaluation data-exploration-and-preprocessing

Updated May 6, 2024
Jupyter Notebook

adilrasheed139 / AI-Powered-Resume-Screening-using-BERT

Star

Successfully developed a resume classification model which can accurately classify the resume of any person into its corresponding job with a tremendously high accuracy of more than 99%.

nlp deep-learning word-embeddings nlp-machine-learning model-evaluation text-preprocessing bert-model text-tokenization fine-tuning-bert exploratory-data-analysis-eda word-embeddings-for-nlp

Updated Dec 14, 2024
Jupyter Notebook

SayamAlt / Fake-News-Classification-using-fine-tuned-BERT

Star

Successfully developed a text classification model to predict whether a given news text is fake or not by fine-tuning a pretrained BERT transformed model imported from Hugging Face.

deep-learning text-classification data-visualization data-analysis model-evaluation text-preprocessing bert-model bert-embeddings text-tokenization wordcloud-visualization fine-tuning-bert tokenizer-nlp model-training-and-evaluation

Updated Dec 10, 2024
Jupyter Notebook

Software-Research-Lab / dropsuit-tok

Star

The tok function is a JavaScript and Node.js function that processes object instances and tokenizes text arrays. It returns tokenized words number, tokenized words array, and tokenized words concatenated string. It's part of the open-source DropSuit NLP library under the Apache License 2.0.

text-analysis text-processing language-understanding text-tokenization

Updated May 1, 2023
JavaScript

katanabana / Nihotip

Star

Nihotip is a web app that lets users explore Japanese text through interactive tokenization and detailed insights. Built with React and Python, it offers a dynamic way to analyze words and symbols with tooltips for deeper understanding.

react tooltips python nlp language japanese text-analysis webapp japanese-language mecab tokenization japanese-characters wanakana text-tokenization japanese-learning sudachipy jmdictfurigana

Updated Sep 26, 2024
JavaScript

victoryosiobe / kingchop

Sponsor

Star

Kingchop ⚔️ is a JavaScript English based library for tokenizing text (chopping text). It uses vast rules for tokenizing, and you can adjust them easily.

nodejs javascript natural-language-processing text-processing sentence-tokenizer text-tokenization word-tokenizer tokenizers paragraph-tokenizer

Updated Jul 7, 2026
JavaScript

JustVugg / Biptoken

Star

Fast BPE tokenizer with guaranteed perfect text reconstruction. 2x faster than tiktoken at encoding/decoding while preserving all whitespace and formatting.

python nlp open-source tokenizer python-library text-tokenization bpe-tokenizer fast-tokenizer perfect-reconstruction whitespace-preservation llm-tooling

Updated Jun 13, 2025
Python

SayamAlt / News-Category-Classification

Star

Successfully developed a news category classification model using fine-tuned BERT which can accurately classify any news text into its respective category i.e. Politics, Business, Technology and Entertainment.

nlp text-classification exploratory-data-analysis feature-engineering model-evaluation text-cleaning text-preprocessing bert-embeddings text-tokenization fine-tuning-bert

Updated Jan 17, 2023
Jupyter Notebook

SayamAlt / Mental-Health-Classification-using-fine-tuned-DistilBERT

Star

Successfully established a multiclass text classification model by fine-tuning pretrained DistilBERT transformer model to classify several distinct types of mental health statuses such as anxiety, stress, personality disorder, etc. with an accuracy of 77%.

natural-language-processing deep-learning text-classification data-visualization model-evaluation text-preprocessing text-tokenization multiclass-text-classification distilbert-model model-inference model-training-and-evaluation distilbert-fine-tuning

Updated Jan 6, 2025
Jupyter Notebook

LokeshKenche / ISP_ChatBot

Star

ISPY ChatBot ISPY is a chatbot designed for ISP customer service, providing automated responses and assistance for various queries such as connection issues, payments, and service requests. Built using Python with libraries like nltk and newspaper3k, it simulates conversation and handles customer interactions effectively.

machine-learning chatbot nltk cosine-similarity webscraping nlp-machine-learning textanalysis customer-services newspaper3k text-tokenization text-based-chatbot article-parsing

Updated Apr 14, 2024
Jupyter Notebook

SayamAlt / Global-News-Headlines-Text-Summarization

Star

Successfully established a text summarization model using Seq2Seq modeling with Luong Attention, which can give a short and concise summary of the global news headlines.

natural-language-processing text-generation text-summarization attention-mechanism seq2seq-model luong-attention text-tokenization model-inference model-architecture-and-implementation data-exploration-and-preprocessing

Updated May 6, 2024
Jupyter Notebook

Yeasir-Hossain / keycloak-authentication

Star

A full-stack text analysis application with Keycloak authentication, implementing word tokenization, sentence boundary detection, and longest word extraction algorithms with Redis caching for performance optimization.

react nodejs docker redis caching jwt express typescript mongodb keycloak authentication text-analysis containerization mern-stack sentence-boundary-detection longest-word text-tokenization paragraph-detection word-frequency-analysis

Updated May 17, 2025
TypeScript

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-tokenization

Here are 26 public repositories matching this topic...

alasdairforsythe / tokenmonster

twardoch / split-markdown4gpt

Mecanik / Modern-Text-Tokenizer

SayamAlt / Resume-Classification-using-fine-tuned-BERT

hikmatazimzade / azerbaijani-tokenizer

SayamAlt / Symptoms-Disease-Text-Classification

markiskorova / Machine-Learning-NLP-Predict-Author

SayamAlt / Cyberbullying-Classification-using-fine-tuned-DistilBERT

SayamAlt / Financial-News-Sentiment-Analysis

adilrasheed139 / AI-Powered-Resume-Screening-using-BERT

SayamAlt / Fake-News-Classification-using-fine-tuned-BERT

Software-Research-Lab / dropsuit-tok

katanabana / Nihotip

victoryosiobe / kingchop

JustVugg / Biptoken

SayamAlt / News-Category-Classification

SayamAlt / Mental-Health-Classification-using-fine-tuned-DistilBERT

LokeshKenche / ISP_ChatBot

SayamAlt / Global-News-Headlines-Text-Summarization

Yeasir-Hossain / keycloak-authentication

Improve this page

Add this topic to your repo