Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
-
Updated
Nov 17, 2025 - Python
Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
Implemented GPT from scratch
LLM inference engine built from scratch in C++. No PyTorch, no frameworks.
GPT-style language model with Byte Pair Encoding tokenizer, built from scratch in PyTorch.
A PHP implementation of OpenAI's BPE tokenizer tiktoken.
Teaching transformer-based architectures
BPE tokenizer for LLMs in Pure Zig
High-Performance Tokenizer implementation in PHP.
Byte-Pair Encoding tokenizer for training large language models on huge datasets
Multi-language BPE tokenizer implementation for Qwen3 models. Lightweight byte-pair encoding for C#/.NET
implementation of Byte-Pair Encoding (BPE) for subword tokenization, written entirely in C++ . The tokenizer learns merges from raw text and supports encoding/decoding with UTF-8
R-BPE: Improving BPE-Tokenizers with Token Reuse
(1) Train large language models to help people with automatic essay scoring. (2) Extract essay features and train new tokenizer to build tree models for score prediction.
🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.
a parallel and minimal implementation of Byte Pair Encoding (BPE) from scratch in less than 200 lines of python.
[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust
High-performance GPT-2-style BPE tokenizer in C++ with parallel batch encoding, thread-pool execution, thread-local caching, and benchmark-driven comparison against tiktoken and GPT2TokenizerFast
Add a description, image, and links to the bpe-tokenizer topic page so that developers can more easily learn about it.
To associate your repository with the bpe-tokenizer topic, visit your repo's landing page and select "manage topics."