Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

Merged
merged 12 commits into from
Sep 28, 2024

Conversation

kylo5aby
Copy link
Contributor

@kylo5aby kylo5aby commented Sep 12, 2024

ref #9369

the PR mainly does:

  • create a tokenizer object after initialize the llama_vocab
  • put the initialization of mutable data of each tokenizer object in llama_tokenize_internal for thread safe

@kylo5aby kylo5aby changed the title refactor tokenizer to reduce the overhead of creating multi times tokenizer vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer Sep 12, 2024
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This technically works, but IMO we could improve it a bit:

Currently, llm_tokenizer_data to store working data and llm_tokenizer to store "fixed" shared data. However, because tokenize() is still inside llm_tokenizer, there is nothing prevent it from mutating data inside llm_tokenizer (which is not desirable).

A better solution would be:

  • Move tokenize() function to llm_tokenizer_data (maybe change the class name to llm_tokenizer_session to reflect that the object is short-loved)
  • tokenize() take const llm_tokenizer * as argument. The const is to make sure that the shared object is read-only.

@kylo5aby
Copy link
Contributor Author

This technically works, but IMO we could improve it a bit:

Currently, llm_tokenizer_data to store working data and llm_tokenizer to store "fixed" shared data. However, because tokenize() is still inside llm_tokenizer, there is nothing prevent it from mutating data inside llm_tokenizer (which is not desirable).

A better solution would be:

  • Move tokenize() function to llm_tokenizer_data (maybe change the class name to llm_tokenizer_session to reflect that the object is short-loved)
  • tokenize() take const llm_tokenizer * as argument. The const is to make sure that the shared object is read-only.

@ngxson Thanks your feedback, what approach do you think is better:

  1. declare tokenize in llm_tokenizer_session, and pass const llm_tokenizer * tokenizer, and call tokenizer's related method in the inner, that means the logic implementation(eg. tokenize, append) is still belong to every llm_tokenizer. For example:
struct llm_tokenizer_bpe_session : llm_tokenizer_session {

    void tokenize(const llm_tokenizer * tokenizer, const std::string & text, std::vector<llama_vocab::id> & output) {
        tokenizer->tokenize(text, *this, output);
    }

    std::vector<llm_symbol> symbols;
    std::vector<llm_symbol> symbols_final;
    llm_bigram_bpe::queue work_queue;
};
  1. Keep the structures as in current PR, and declare all methods as const for every llm_tokenizer , so the shared data in every llm_tokenizer are not allowed to modify. That maybe bring smaller change.

  2. move all the logic implementation from every llm_tokenizer to llm_tokenizer_session, the llm_tokenizer will only contains shared data, every time want to use tokenize operations, it must create a llm_tokenizer_session, which maybe bring a break change.

@ngxson
Copy link
Collaborator

ngxson commented Sep 14, 2024

The 3rd option should be the proper way. The idea is:

  • llm_tokenizer contains only fixed/shared data. It will be initialized with the model, stored in model.tokenizer
  • Functions using the tokenizer like llama_tokenize_impl, llama_token_to_piece_impl, llama_detokenize_impl will take const llm_tokenizer & as input. They no longer take const vocab as a param because vocab is already inside const llm_tokenizer

every time want to use tokenize operations, it must create a llm_tokenizer_session, which maybe bring a break change.

I don't see why it's a breaking change compared to what you're currently doing with llm_tokenizer_data. Every time the tokenizer is used, a new llm_tokenizer_session must be created.

The constructor of the session will look like: llm_tokenizer_session(const llm_tokenizer & tokenizer)

src/llama-vocab.h Outdated Show resolved Hide resolved
tokenizer = new llm_tokenizer_rwkv(vocab);
break;
default:
GGML_ABORT("fatal error");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GGML_ABORT("fatal error");
GGML_ABORT("unknown vocab type");

src/llama-vocab.cpp Outdated Show resolved Hide resolved
src/llama-vocab.cpp Outdated Show resolved Hide resolved
@ngxson ngxson requested a review from ggerganov September 19, 2024 11:49
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylo5aby Are you interested in adding a test that accepts a vocab file (see ./models/ggml-vocab*.gguf and tokenizes random strings in parallel on multiple threads? The test-log test can be used as a starting point. The goal is to run it through thread sanitizer in order to guarantee thread-safety of the tokenization API.

struct llm_tokenizer_wpm {
llm_tokenizer_wpm(const llama_vocab & vocab): vocab(vocab) {}
struct llm_tokenizer_wpm : llm_tokenizer {
llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {}
llm_tokenizer_wpm(const llama_vocab & vocab) : llm_tokenizer(vocab) {}

struct llm_tokenizer_bpe {
llm_tokenizer_bpe(const llama_vocab & vocab): vocab(vocab) {
struct llm_tokenizer_bpe : llm_tokenizer {
llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) {
llm_tokenizer_bpe(const llama_vocab & vocab) : llm_tokenizer(vocab) {

Comment on lines 87 to 105
# build test-tokenizer-parallel target once and add many tests
add_executable(test-tokenizer-parallel test-tokenizer-parallel.cpp)
target_link_libraries(test-tokenizer-parallel PRIVATE common)
install(TARGETS test-tokenizer-parallel RUNTIME)

llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-bert-bge ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-bert-bge.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-command-r ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-command-r.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-coder ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-coder.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-llm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-llm.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-falcon ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-falcon.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-gpt-2 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-gpt-2.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-bpe ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-bpe.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-spm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-spm.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-mpt ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-mpt.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-phi-3 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-phi-3.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-qwen2 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-qwen2.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-refact ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-refact.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-starcoder ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-starcoder.gguf)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing test-tokenizer-0 to become multi-threaded. You've pretty much done it in test-tokenizer-parallel.cpp, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing test-tokenizer-0 to become multi-threaded. You've pretty much done it in test-tokenizer-parallel.cpp, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.

how about introduce a mutex to make the printing of results orderly?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about introduce a mutex to make the printing of results orderly?

Nah, it would still be scrambled. Since there aren't that many tests anyway, maybe it's simpler to have all threads compute all tests in parallel and only the first thread to print the results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. I'll update test-tokenizer-0 and makes it only print the result of thread 0, if there are any data race, the thread sanitizer will post warnings or abort.

@github-actions github-actions bot added the testing Everything test related label Sep 20, 2024
@kylo5aby
Copy link
Contributor Author

I found there is a CI error on windows(clang), as mentioned in #9557, that due to symbol confict, after converting the types in examples, this would be solved.

@ggerganov
Copy link
Owner

To fix the CI and enable merge, rename llama_vocab to my_llama_vocab here:

struct llama_vocab {
using id = int32_t;
using token = std::string;
using ttype = llama_token_type;

src/llama.cpp Outdated Show resolved Hide resolved
src/llama.cpp Outdated Show resolved Hide resolved
@Nekotekina
Copy link
Contributor

Thank you very much, without this PR, tokenization in e5-like models takes insane amount of time. Now it's nearly instant.

@ggerganov ggerganov merged commit 6102037 into ggerganov:master Sep 28, 2024
53 checks passed
matiaslin pushed a commit to matiaslin/llama.cpp that referenced this pull request Sep 28, 2024
* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
matiaslin pushed a commit to matiaslin/llama.cpp that referenced this pull request Sep 28, 2024
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants