Skip to content

Tokenizer Chopper is a implementation of a text tokenizer and detokenizer using Byte Pair Encoding (BPE) for modern LLM systems.

License

Notifications You must be signed in to change notification settings

gnatykdm/tokenizer-chopper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tokenizer Chopper

Tokenizer Chopper is a simple C++ implementation of a text tokenizer and detokenizer using Byte Pair Encoding (BPE). This project utilizes Clang and Visual Studio Code to process input text, tokenize it into subword units, and then detokenize it back to its original form.

Features

  • Tokenization: Splits the input text into token units based on BPE.
  • Detokenization: Reconstructs the original text from the tokenized form.
  • Customizable Tokenization: The number of merges for the BPE algorithm can be adjusted for finer control over tokenization.

Usage

  1. Clone the repository:

    git clone https://github.com/gnatykdm/tokenizer-chopper.git
    cd tokenizer-chopper/src
  2. Build the project using Clang:

    clang++ main.cpp bpe.cpp -o tokenizer
  3. Run the tokenizer:

    ./tokenizer

About

Tokenizer Chopper is a implementation of a text tokenizer and detokenizer using Byte Pair Encoding (BPE) for modern LLM systems.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages