-
Notifications
You must be signed in to change notification settings - Fork 6
HF Tokenizers #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HF Tokenizers #11
Conversation
Please use lintunner to format the code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for adding the hf tokenizer, and refactoring the code base!
Please address @larryliu0820 's comment, and fix the submodule issue in CI before merging it! |
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
I've added this in a new include/detail directory and used tokenizers::detail as the namespace. This base class should not be part of the public interface, but it needs to be in the include tree since it will be used by multiple public interface classes. Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Programatically discover unit tests to build * Use target_compile_definitions to set RESOURCES_PATH and avoid env vars * Add unit tests to ctest so that they can be run programmatically * Run all unit tests in CI with ctest Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This code is available under the MIT license which is retained in the files Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is a pratial implementation of the pre tokenizers from the HF project. Others can be added by extending the set of derived classes and the factory. Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This still doesn't implement the post_processor portion of the HF tokenizers library Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
c1776a3
to
681df89
Compare
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is a cleaner designation as a third-party asset and avoids needing to carefully read the comment (plus ignore it for linting). Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Comments addressed! |
Thank you! Please make it “Ready for review” so that it can be merged. Also please update the submodule commit hash in torchchat after this PR is merged. |
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@larryliu0820 Ok, officially ready for review. The linux build was failing with undefined symbols for |
Description
This PR introduces a collection of new capabilities in support of Huggingface tokenizer models. It is minimally complete at this point, so I'm opening up this PR as a point of discussion. The main goal is to add compatibility with the tokenizers library at the C++ level.
Additions
Dependencies
tokenizer.json
files from HF tokenizersunicode.[h|cpp]
andunicode-data.[h|cpp]
fromllama.cpp
. These implement the full set of transformations needed to go to/from the byte-encoded space needed by theByteLevel
functionality.const std::string&
rather than astring_view
(re2::StringPiece
), so we need to copy from the view to a string before calling thellama.cpp
functions. This could likely be optimized by changing the function signatures in the vendored code, but I tried to avoid any changes to them at this point.third_party
since this was a slim subset of the fullllama.cpp
project.Structure Changes
include/detail
directory to hold headers that are needed by the public interface, but should not be treated as part of the public interface (in particularbpe_tokenizer_base.h
)tokenizers::detail
namespace for all code that lives belowinclude/detail
(and corresponding src files)src/include
for implementation-only headerstools
subdir to hold binary tool builds (tools/tokenize_tool
)Code Changes
PreTokenizer
andTokenDecoder
aligned with the corresponding concepts in the Rust codebaseTiktoken
into a base implementation of BPE (BPETokenizerBase
) with hooks for the specifics of theTiktoken
modelsHFTokenizer
as a second derivedBPETokenizerBase
implementation which adds aPreTokenizer
/TokenDecoder
to the input/output logictokenize_tool
to sanity check tokenizers. This was mostly just for my own sanity, so it may not have much utility in the future, but I figured it may also be a useful development tool so left it in. It might make sense to move it somewhere else?Build and Test
RESOURCES_PATH
to avoid relying on the environmentctest
ctest
instead of invoking test binaries directlytest_tiktoken.cpp
->test_tiktoken
rather thantiktoken_test
)pre_tokenizer
stack, but nothing else (yet!)Testing
Build setup
Unit testing
Spot testing
I also did spot testing with several tokenizer models:
Stil TODO
tiktoken
hard-coded special tokens)Future work
PreTokenizers
andTokenDecoders
that are not yet supported. There's also the wholePostProcessors
suite as well. A good test case for this would be thetokenizers
version of thellama3.1
tokenizer (here)