A clean, educational implementation of the Byte Pair Encoding algorithm used in modern language models like GPT.
This implementation demonstrates how BPE tokenization works from scratch, including:
- Training a vocabulary from text data
- Encoding text into token IDs
- Decoding token IDs back to text
- Saving/Loading trained vocabularies
Only requires the HuggingFace datasets library:
pip install datasets tqdmfrom tokenizer import Tokenizer
# Load pre-trained vocabulary and merges
with open("data/vocab.txt", "r") as f:
vocab = [line.strip() for line in f]
with open("data/merges.txt", "r") as f:
merges = [tuple(line.strip().split()) for line in f]
# Initialize tokenizer
tokenizer = Tokenizer(vocab, merges)
# Encode text
text = "Hello world"
token_ids = tokenizer.tokenize(text)
print(f"Tokens: {token_ids}")
# Decode back
decoded = tokenizer.decode(token_ids)
print(f"Decoded: {decoded}")Start with individual characters as the base vocabulary:
vocab = ['a', 'b', 'c', ..., '</w>'] # </w> marks word boundariesThe algorithm repeatedly:
- Counts the most frequent adjacent character pair
- Merges them into a new token
- Adds the new token to the vocabulary
Example progression:
"the" β ['t', 'h', 'e', '</w>']
β ['th', 'e', '</w>'] # Merge 't'+'h'
β ['the', '</w>'] # Merge 'th'+'e'
β ['the</w>'] # Merge 'the'+'</w>'
def train_bpe(words, vocab_size):
# Start with character-level tokens
word_tokens = [list(word + '</w>') for word in words]
# Merge until reaching target vocab size
while len(vocab) < vocab_size:
# Find most frequent pair
pair_counts = count_pairs(word_tokens)
best_pair = max(pair_counts)
# Merge pair in all words
word_tokens = merge_pair(word_tokens, best_pair)
vocab.append(best_pair[0] + best_pair[1])- Subword units balance vocabulary size and coverage
- Handles out-of-vocabulary words by falling back to characters
- Preserves word boundaries with
</w>marker
- Stores only vocabulary and merge operations
- No need to save full token-to-ID mappings
- Deterministic merge order ensures consistent results
- Saved merge history allows exact reconstruction
bpe-tokenizer/
βββ bpe_tokenizer.py # Main tokenizer implementation
βββ README.md # This file
βββ data/
βββ vocab.txt # Trained vocabulary
βββ merges.txt # Merge operations history
from datasets import load_dataset
# Load your dataset
dataset = load_dataset("your-dataset")
text = " ".join(dataset["train"]["text"])
# Prepare text
words = [word.lower() + '</w>' for word in text.split()]
word_tokens = [list(word) for word in words]
# Train BPE
vocab_size = 4000
vocab, merges = train_bpe(word_tokens, vocab_size)
# Save vocabulary and merges
save_vocab("data/vocab.txt", vocab)
save_merges("data/merges.txt", merges)Key parameters to adjust:
| Parameter | Description | Typical Range |
|---|---|---|
vocab_size |
Target vocabulary size | 1,000 - 50,000 |
fraction |
Dataset fraction to use | 0.1 - 1.0 |
- Small (1K-5K): Faster training, more tokens per word
- Medium (10K-20K): Balanced performance
- Large (30K-50K): Fewer tokens, slower training
Training Time:
- 4K vocabulary: ~15 minutes
- 10K vocabulary: ~45 minutes
- 30K vocabulary: ~3 hours
Memory Usage:
- Vocabulary: ~50KB per 1K tokens
- Merge history: ~100KB per 1K merges
!
"
a
b
...
th
the
the</w>
t h
th e
the </w>
...
Each line represents a merge operation applied during training.
Solution: Increase vocabulary size or train on more data
Solution: Ensure all characters in test data were in training data
Solution: Pre-compile frequently used merges or use caching
- Neural Machine Translation of Rare Words with Subword Units - Original BPE paper
- HuggingFace Tokenizers Documentation
- OpenAI GPT-2 BPE Implementation
MIT License - Feel free to use for educational or commercial purposes
Contributions welcome! Areas for improvement:
- Add vocabulary size optimization
- Implement parallel processing
- Add more encoding/decoding options
- Create visualization tools
Based on the Byte Pair Encoding algorithm introduced by Sennrich et al. (2016) for neural machine translation.