-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449
Conversation
5447152
to
9ce57e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This technically works, but IMO we could improve it a bit:
Currently, llm_tokenizer_data
to store working data and llm_tokenizer
to store "fixed" shared data. However, because tokenize()
is still inside llm_tokenizer
, there is nothing prevent it from mutating data inside llm_tokenizer
(which is not desirable).
A better solution would be:
- Move
tokenize()
function tollm_tokenizer_data
(maybe change the class name tollm_tokenizer_session
to reflect that the object is short-loved) tokenize()
takeconst llm_tokenizer *
as argument. Theconst
is to make sure that the shared object is read-only.
@ngxson Thanks your feedback, what approach do you think is better:
struct llm_tokenizer_bpe_session : llm_tokenizer_session {
void tokenize(const llm_tokenizer * tokenizer, const std::string & text, std::vector<llama_vocab::id> & output) {
tokenizer->tokenize(text, *this, output);
}
std::vector<llm_symbol> symbols;
std::vector<llm_symbol> symbols_final;
llm_bigram_bpe::queue work_queue;
};
|
The 3rd option should be the proper way. The idea is:
I don't see why it's a breaking change compared to what you're currently doing with The constructor of the session will look like: |
9ce57e0
to
0166d83
Compare
0166d83
to
5abee9c
Compare
src/llama-vocab.cpp
Outdated
tokenizer = new llm_tokenizer_rwkv(vocab); | ||
break; | ||
default: | ||
GGML_ABORT("fatal error"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GGML_ABORT("fatal error"); | |
GGML_ABORT("unknown vocab type"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylo5aby Are you interested in adding a test that accepts a vocab file (see ./models/ggml-vocab*.gguf
and tokenizes random strings in parallel on multiple threads? The test-log
test can be used as a starting point. The goal is to run it through thread sanitizer in order to guarantee thread-safety of the tokenization API.
src/llama-vocab.cpp
Outdated
struct llm_tokenizer_wpm { | ||
llm_tokenizer_wpm(const llama_vocab & vocab): vocab(vocab) {} | ||
struct llm_tokenizer_wpm : llm_tokenizer { | ||
llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {} | |
llm_tokenizer_wpm(const llama_vocab & vocab) : llm_tokenizer(vocab) {} |
src/llama-vocab.cpp
Outdated
struct llm_tokenizer_bpe { | ||
llm_tokenizer_bpe(const llama_vocab & vocab): vocab(vocab) { | ||
struct llm_tokenizer_bpe : llm_tokenizer { | ||
llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) { | |
llm_tokenizer_bpe(const llama_vocab & vocab) : llm_tokenizer(vocab) { |
5abee9c
to
d949c58
Compare
tests/CMakeLists.txt
Outdated
# build test-tokenizer-parallel target once and add many tests | ||
add_executable(test-tokenizer-parallel test-tokenizer-parallel.cpp) | ||
target_link_libraries(test-tokenizer-parallel PRIVATE common) | ||
install(TARGETS test-tokenizer-parallel RUNTIME) | ||
|
||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-bert-bge ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-bert-bge.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-command-r ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-command-r.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-coder ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-coder.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-llm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-llm.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-falcon ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-falcon.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-gpt-2 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-gpt-2.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-bpe ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-bpe.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-spm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-spm.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-mpt ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-mpt.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-phi-3 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-phi-3.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-qwen2 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-qwen2.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-refact ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-refact.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-starcoder ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-starcoder.gguf) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing test-tokenizer-0
to become multi-threaded. You've pretty much done it in test-tokenizer-parallel.cpp
, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing
test-tokenizer-0
to become multi-threaded. You've pretty much done it intest-tokenizer-parallel.cpp
, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.
how about introduce a mutex
to make the printing of results orderly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about introduce a mutex to make the printing of results orderly?
Nah, it would still be scrambled. Since there aren't that many tests anyway, maybe it's simpler to have all threads compute all tests in parallel and only the first thread to print the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasonable. I'll update test-tokenizer-0
and makes it only print the result of thread 0, if there are any data race, the thread sanitizer will post warnings or abort.
d949c58
to
403758f
Compare
…lama.cpp into ggerganov-gg/tokenizer-cleanup
I found there is a CI error on windows(clang), as mentioned in #9557, that due to symbol confict, after converting the types in |
To fix the CI and enable merge, rename llama.cpp/examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp Lines 203 to 207 in 37f8c7b
|
Thank you very much, without this PR, tokenization in e5-like models takes insane amount of time. Now it's nearly instant. |
* refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * refactor tokenizer * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * remove unused files * remove unused fileds to avoid unused filed build error * avoid symbol link error * Update src/llama.cpp * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * refactor tokenizer * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * remove unused files * remove unused fileds to avoid unused filed build error * avoid symbol link error * Update src/llama.cpp * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * refactor tokenizer * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * remove unused files * remove unused fileds to avoid unused filed build error * avoid symbol link error * Update src/llama.cpp * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * refactor tokenizer * refactor tokenizer * llama : make llm_tokenizer more private ggml-ci * remove unused files * remove unused fileds to avoid unused filed build error * avoid symbol link error * Update src/llama.cpp * Update src/llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ref #9369
the PR mainly does:
llama_vocab
llama_tokenize_internal
for thread safe