HF Tokenizers #11

gabe-l-hart · 2024-12-11T20:39:46Z

Description

This PR introduces a collection of new capabilities in support of Huggingface tokenizer models. It is minimally complete at this point, so I'm opening up this PR as a point of discussion. The main goal is to add compatibility with the tokenizers library at the C++ level.

Additions

Dependencies

Add a third-party dependency on nlohmann/json to support parsing tokenizer.json files from HF tokenizers
Add a vendored copy of the unicode.[h|cpp] and unicode-data.[h|cpp] from llama.cpp. These implement the full set of transformations needed to go to/from the byte-encoded space needed by the ByteLevel functionality.
- The files are under MIT license, so I copied the license text into a header and left a reference to the commit I used to copy them
- There is a possible inefficiency when interfacing with these methods since they use const std::string& rather than a string_view (re2::StringPiece), so we need to copy from the view to a string before calling the llama.cpp functions. This could likely be optimized by changing the function signatures in the vendored code, but I tried to avoid any changes to them at this point.
- I chose to put these directly into the source rather than add them to third_party since this was a slim subset of the full llama.cpp project.

Structure Changes

Add the include/detail directory to hold headers that are needed by the public interface, but should not be treated as part of the public interface (in particular bpe_tokenizer_base.h)
Used the tokenizers::detail namespace for all code that lives below include/detail (and corresponding src files)
Added src/include for implementation-only headers
Added tools subdir to hold binary tool builds (tools/tokenize_tool)
- I chose to put each tool in a subdir so that tools themselves can have multiple files without getting cluttered

Code Changes

Introduce the concept of a PreTokenizer and TokenDecoder aligned with the corresponding concepts in the Rust codebase
- These are not 1:1 ports, but rather a logical port to fit with the existing tokenizer class structure
Split up Tiktoken into a base implementation of BPE (BPETokenizerBase) with hooks for the specifics of the Tiktoken models
Add the HFTokenizer as a second derived BPETokenizerBase implementation which adds a PreTokenizer / TokenDecoder to the input/output logic
Added the tokenize_tool to sanity check tokenizers. This was mostly just for my own sanity, so it may not have much utility in the future, but I figured it may also be a useful development tool so left it in. It might make sense to move it somewhere else?

Build and Test

Refactored unit test build setup:
- Programmatically discover test files
- Use a #define to set RESOURCES_PATH to avoid relying on the environment
- Add all tests to ctest
- Update the CI to use ctest instead of invoking test binaries directly
- NOTE: The naming of the binaries changed to match the naming of the files (test_tiktoken.cpp -> test_tiktoken rather than tiktoken_test)
Added reasonably complete unit tests for the pre_tokenizer stack, but nothing else (yet!)

Testing

Build setup

# Set up the build dir
cd /path/to/tokenizers
mkdir -p build
cd build

# Configure cmake to build the unit tests and the tokenize tool
cmake .. -D TOKENIZERS_BUILD_TEST=ON -D TOKENIZERS_BUILD_TOOLS=ON

# Build it
make -j

Unit testing

# (From build)

# Run all unit tests
ctest

# Run the test binary directly
./pre_tokenizer_test

Spot testing

I also did spot testing with several tokenizer models:

Granite Code: https://huggingface.co/ibm-granite/granite-3b-code-instruct-128k/blob/main/tokenizer.json
Llama 3.1 (tiktoken): https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/original/tokenizer.model

# Test with an HF tokenizer json directly
./tools/tokenize_tool/tokenize_tool hf_tokenizer $HOME/models/ibm-granite/granite-3b-code-instruct-128k/tokenizer.json This is a test 1234 5 foobar

# Test with an HF model repo that has tokenizer.json and tokenizer_config.json
./tools/tokenize_tool/tokenize_tool hf_tokenizer $HOME/models/ibm-granite/granite-3b-code-instruct-128k This is a test 1234 5 foobar

# Test with a tiktoken tokenizer
./tools/tokenize_tool/tokenize_tool tiktoken $HOME/models/meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.model This is a test 1234 5 foobar

Stil TODO

Add unit tests for the decoder suite
Add unit tests for the hf tokenizer itself
Ensure that special tokens are handled correctly with the HFTokenizer (no more reliance on tiktoken hard-coded special tokens)

Future work

There are still a lot of PreTokenizers and TokenDecoders that are not yet supported. There's also the whole PostProcessors suite as well. A good test case for this would be the tokenizers version of the llama3.1 tokenizer (here)

.gitmodules

include/hf_tokenizer.h

larryliu0820 · 2024-12-13T13:07:37Z

Please use lintunner to format the code

iseeyuan

LGTM. Thanks for adding the hf tokenizer, and refactoring the code base!

iseeyuan · 2024-12-14T00:20:16Z

Please address @larryliu0820 's comment, and fix the submodule issue in CI before merging it!