Skip to content

HF Tokenizers #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 21, 2024
Merged

HF Tokenizers #11

merged 16 commits into from
Dec 21, 2024

Conversation

gabe-l-hart
Copy link
Contributor

Description

This PR introduces a collection of new capabilities in support of Huggingface tokenizer models. It is minimally complete at this point, so I'm opening up this PR as a point of discussion. The main goal is to add compatibility with the tokenizers library at the C++ level.

Additions

Dependencies

  • Add a third-party dependency on nlohmann/json to support parsing tokenizer.json files from HF tokenizers
  • Add a vendored copy of the unicode.[h|cpp] and unicode-data.[h|cpp] from llama.cpp. These implement the full set of transformations needed to go to/from the byte-encoded space needed by the ByteLevel functionality.
    • The files are under MIT license, so I copied the license text into a header and left a reference to the commit I used to copy them
    • There is a possible inefficiency when interfacing with these methods since they use const std::string& rather than a string_view (re2::StringPiece), so we need to copy from the view to a string before calling the llama.cpp functions. This could likely be optimized by changing the function signatures in the vendored code, but I tried to avoid any changes to them at this point.
    • I chose to put these directly into the source rather than add them to third_party since this was a slim subset of the full llama.cpp project.

Structure Changes

  • Add the include/detail directory to hold headers that are needed by the public interface, but should not be treated as part of the public interface (in particular bpe_tokenizer_base.h)
  • Used the tokenizers::detail namespace for all code that lives below include/detail (and corresponding src files)
  • Added src/include for implementation-only headers
  • Added tools subdir to hold binary tool builds (tools/tokenize_tool)
    • I chose to put each tool in a subdir so that tools themselves can have multiple files without getting cluttered

Code Changes

  • Introduce the concept of a PreTokenizer and TokenDecoder aligned with the corresponding concepts in the Rust codebase
    • These are not 1:1 ports, but rather a logical port to fit with the existing tokenizer class structure
  • Split up Tiktoken into a base implementation of BPE (BPETokenizerBase) with hooks for the specifics of the Tiktoken models
  • Add the HFTokenizer as a second derived BPETokenizerBase implementation which adds a PreTokenizer / TokenDecoder to the input/output logic
  • Added the tokenize_tool to sanity check tokenizers. This was mostly just for my own sanity, so it may not have much utility in the future, but I figured it may also be a useful development tool so left it in. It might make sense to move it somewhere else?

Build and Test

  • Refactored unit test build setup:
    • Programmatically discover test files
    • Use a #define to set RESOURCES_PATH to avoid relying on the environment
    • Add all tests to ctest
    • Update the CI to use ctest instead of invoking test binaries directly
    • NOTE: The naming of the binaries changed to match the naming of the files (test_tiktoken.cpp -> test_tiktoken rather than tiktoken_test)
  • Added reasonably complete unit tests for the pre_tokenizer stack, but nothing else (yet!)

Testing

Build setup

# Set up the build dir
cd /path/to/tokenizers
mkdir -p build
cd build

# Configure cmake to build the unit tests and the tokenize tool
cmake .. -D TOKENIZERS_BUILD_TEST=ON -D TOKENIZERS_BUILD_TOOLS=ON

# Build it
make -j

Unit testing

# (From build)

# Run all unit tests
ctest

# Run the test binary directly
./pre_tokenizer_test

Spot testing

I also did spot testing with several tokenizer models:

# Test with an HF tokenizer json directly
./tools/tokenize_tool/tokenize_tool hf_tokenizer $HOME/models/ibm-granite/granite-3b-code-instruct-128k/tokenizer.json This is a test 1234 5 foobar

# Test with an HF model repo that has tokenizer.json and tokenizer_config.json
./tools/tokenize_tool/tokenize_tool hf_tokenizer $HOME/models/ibm-granite/granite-3b-code-instruct-128k This is a test 1234 5 foobar

# Test with a tiktoken tokenizer
./tools/tokenize_tool/tokenize_tool tiktoken $HOME/models/meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.model This is a test 1234 5 foobar

Stil TODO

  • Add unit tests for the decoder suite
  • Add unit tests for the hf tokenizer itself
  • Ensure that special tokens are handled correctly with the HFTokenizer (no more reliance on tiktoken hard-coded special tokens)

Future work

  • There are still a lot of PreTokenizers and TokenDecoders that are not yet supported. There's also the whole PostProcessors suite as well. A good test case for this would be the tokenizers version of the llama3.1 tokenizer (here)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 11, 2024
@gabe-l-hart gabe-l-hart changed the title Hf tokenizers HF Tokenizers Dec 11, 2024
@larryliu0820
Copy link
Contributor

Please use lintunner to format the code

Copy link

@iseeyuan iseeyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding the hf tokenizer, and refactoring the code base!

@iseeyuan
Copy link

Please address @larryliu0820 's comment, and fix the submodule issue in CI before merging it!

Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
I've added this in a new include/detail directory and used
tokenizers::detail as the namespace. This base class should not be part of
the public interface, but it needs to be in the include tree since it will
be used by multiple public interface classes.

Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Programatically discover unit tests to build
* Use target_compile_definitions to set RESOURCES_PATH and avoid env vars
* Add unit tests to ctest so that they can be run programmatically
* Run all unit tests in CI with ctest

Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This code is available under the MIT license which is retained in the files

Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is a pratial implementation of the pre tokenizers from the HF project.
Others can be added by extending the set of derived classes and the
factory.

Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This still doesn't implement the post_processor portion of the HF
tokenizers library

Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is a cleaner designation as a third-party asset and avoids needing to
carefully read the comment (plus ignore it for linting).

Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart
Copy link
Contributor Author

Comments addressed!

@larryliu0820
Copy link
Contributor

Thank you! Please make it “Ready for review” so that it can be merged. Also please update the submodule commit hash in torchchat after this PR is merged.

@gabe-l-hart gabe-l-hart marked this pull request as ready for review December 19, 2024 17:01
Branch: HFTokenizers

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart
Copy link
Contributor Author

@larryliu0820 Ok, officially ready for review. The linux build was failing with undefined symbols for va_start/va_end, so I added the #include <cstdarg> line. I'm on a Mac, so if it's still failing I can try building in a docker container or on a remote build box, but I'm hoping this just fixes things.

@larryliu0820 larryliu0820 merged commit 2aefafa into pytorch-labs:main Dec 21, 2024
2 checks passed
@gabe-l-hart gabe-l-hart deleted the HFTokenizers branch January 2, 2025 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants