Skip to content

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization, with PyTorch/CUDA

License

Notifications You must be signed in to change notification settings

kuprel/minbpe-pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

minbpe-pytorch

Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.

This adds PyTorch/CUDA training and encoding support to Andrej Karpathy's minbpe. It takes 67.4 seconds on an H100 with SXM5 to train the BasicTokenizer with a vocab_size of 512 on 308MB of Enron emails. The original code takes 2hrs 15min on an M2 Air with Python 3.11 to do this. That is a 120x speedup.

quick start

Install requirements:

$ pip install -r requirements.txt

Download Enron emails and save a 308MB text file to tests/enron.txt:

$ python get_enron_emails.py

Train a BasicTokenizer on the large text file:

$ python train.py

The model will be saved in the models directory.

tests

The pytest library is used for tests. All of them are located in the tests/ directory. First pip install pytest, then:

$ pytest -v .

todo

  • Speed up encode method for RegexTokenizer
  • Implement train method for RegexTokenizer
  • Support MPS device for MacBooks, currently breaks for torch.unique

License

MIT

About

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization, with PyTorch/CUDA

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages