|  | 
| 1 | 1 | # tokenizers-cpp | 
| 2 | 2 | 
 | 
| 3 |  | -Cross platform universal tokenizer binding to HF and sentencepiece | 
|  | 3 | +This project provides a cross-platform C++ tokenizer binding library that can be universally deployed. | 
|  | 4 | +It wraps and binds the [HuggingFace tokenizers library](https://github.com/huggingface/tokenizers) | 
|  | 5 | +and [sentencepiece](https://github.com/google/sentencepiece) and provides a minimum common interface in C++. | 
|  | 6 | + | 
|  | 7 | +The main goal of the project is to enable tokenizer deployment for language model applications | 
|  | 8 | +to native platforms with minimum dependencies and remove some of the barriers of | 
|  | 9 | +cross-language bindings. This project is developed in part with and | 
|  | 10 | +used in [MLC LLM](https://github.com/mlc-ai/mlc-llm). We have tested the following platforms: | 
|  | 11 | + | 
|  | 12 | +- iOS | 
|  | 13 | +- Android | 
|  | 14 | +- Windows | 
|  | 15 | +- Linux | 
|  | 16 | +- Web browser | 
|  | 17 | + | 
|  | 18 | +## Getting Started | 
|  | 19 | + | 
|  | 20 | +The easiest way is to add this project as a submodule and then | 
|  | 21 | +include it via `add_sub_directory` in your CMake project. | 
|  | 22 | +You also need to turn on `c++17` support. | 
|  | 23 | + | 
|  | 24 | +- First, you need to make sure you have rust installed. | 
|  | 25 | +- If you are cross-compiling make sure you install the necessary target in rust. | 
|  | 26 | +  For example, run `rustup target add aarch64-apple-ios` to install iOS target. | 
|  | 27 | +- You can then link the libary | 
|  | 28 | + | 
|  | 29 | +See [example](example) folder for an example CMake project. | 
|  | 30 | + | 
|  | 31 | +### Example Code | 
|  | 32 | + | 
|  | 33 | +```c++ | 
|  | 34 | +// - dist/tokenizer.json | 
|  | 35 | +void HuggingFaceTokenizerExample() { | 
|  | 36 | +  // Read blob from file. | 
|  | 37 | +  auto blob = LoadBytesFromFile("dist/tokenizer.json"); | 
|  | 38 | +  // Note: all the current factory APIs takes in-memory blob as input. | 
|  | 39 | +  // This gives some flexibility on how these blobs can be read. | 
|  | 40 | +  auto tok = Tokenizer::FromBlobJSON(blob); | 
|  | 41 | +  std::string prompt = "What is the capital of Canada?"; | 
|  | 42 | +  // call Encode to turn prompt into token ids | 
|  | 43 | +  std::vector<int> ids = tok->Encode(prompt); | 
|  | 44 | +  // call Decode to turn ids into string | 
|  | 45 | +  std::string decoded_prompt = tok->Decode(ids); | 
|  | 46 | +} | 
|  | 47 | + | 
|  | 48 | +void SentencePieceTokenizerExample() { | 
|  | 49 | +  // Read blob from file. | 
|  | 50 | +  auto blob = LoadBytesFromFile("dist/tokenizer.model"); | 
|  | 51 | +  // Note: all the current factory APIs takes in-memory blob as input. | 
|  | 52 | +  // This gives some flexibility on how these blobs can be read. | 
|  | 53 | +  auto tok = Tokenizer::FromBlobSentencePiece(blob); | 
|  | 54 | +  std::string prompt = "What is the capital of Canada?"; | 
|  | 55 | +  // call Encode to turn prompt into token ids | 
|  | 56 | +  std::vector<int> ids = tok->Encode(prompt); | 
|  | 57 | +  // call Decode to turn ids into string | 
|  | 58 | +  std::string decoded_prompt = tok->Decode(ids); | 
|  | 59 | +} | 
|  | 60 | +``` | 
|  | 61 | + | 
|  | 62 | +### Extra Details | 
|  | 63 | + | 
|  | 64 | +Currently, the project generates three static libraries | 
|  | 65 | +- `libtokenizers_c.a`: the c binding to tokenizers rust library | 
|  | 66 | +- `libsentencepice.a`: sentencepiece static library | 
|  | 67 | +- `libtokenizers_cpp.a`: the cpp binding implementation | 
|  | 68 | + | 
|  | 69 | +If you are using an IDE, you can likely first use cmake to generate | 
|  | 70 | +these libraries and add them to your development environment. | 
|  | 71 | +If you are using cmake, `target_link_libraries(yourlib tokenizers_cpp)` | 
|  | 72 | +will automatically links in the other two libraries. | 
|  | 73 | +You can also checkout [MLC LLM](https://github.com/mlc-ai/mlc-llm) | 
|  | 74 | +for as an example of complete LLM chat application integrations. | 
|  | 75 | + | 
|  | 76 | +## Javascript Support | 
|  | 77 | + | 
|  | 78 | +We use emscripten to expose tokenizer-cpp to wasm and javascript. | 
|  | 79 | +Checkout [web](web) for more details. | 
|  | 80 | + | 
|  | 81 | +## Acknowledgements | 
|  | 82 | + | 
|  | 83 | +This project is only possible thanks to the shoulders open-source ecosystems that we stand on. | 
|  | 84 | +This project is based on sentencepiece and tokenizers library. | 
0 commit comments