-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : more tokenizer fixes #2810
Conversation
6314b51
to
8976384
Compare
8976384
to
1e7a033
Compare
084546b
to
e4324cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be okay with the server, at least.
|
I'm looking into more details in this code: llama.cpp/examples/server/server.cpp Lines 298 to 307 in e4324cb
I think it should be: {
if (first)
{
prompt_tokens.push_back(bos_token);
first = false;
}
prompt_tokens.push_back(p.template get<llama_token>());
} Not sure how important is this, but mentioning just in case. You will need to propagate |
LF token is not correct, should not be tokenized using prepended space.
Master:
|
I missed this, but yeah there's a difference for LF token in |
Ah good catch. The actual problem is that it should not be tokenized, but we should use |
Now LF should work in both llama and falcon. |
llama2 now correctly showing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to work good.
724fa67
to
dbcf470
Compare
I have verified that this branch compared to
|
802ef4c
to
3bb0f84
Compare
So, I approved the changes, but the PR basically goes back to auto-adding white space at the beginning strings to be tokenized, which changes HellaSwag scores. I.e., it kind of reverts @klosax's fix in #2806. With PR #2805 merged, this no longer leads to the dramatic difference we saw for HellaSwag after the GGUF merge, but still. I guess, the values in #2321 will need to be updated (for instance, we now have a HellaSwag score of 76.75 after 400 tasks for LLaMA-v2-7B instead of the 77.5 shown in the table in #2321). |
Yes, after digging into this yesterday, I believe that the proposed implementation in this PR is the correct one and the old values have to be updated. The auto-prefix with whitespace inside the It's still possible that I am misunderstanding some of the tokenizer logic, but atm I'm fairly confident that the proposal is better than what we have on Edit: Regarding the concatenation comment, here is an example to try to make it clear: Running this Python code first with LLaMA's SPM tokenizer and then with Falcon's BPE tokenizer: print(tokenizer.encode('hello')) # x, beginning
print(tokenizer.encode('world')) # y, ending
print(tokenizer.encode(' world')) # z, ending with whitespace prefix
print(tokenizer.encode('hello world')) # r, concat
# results with SPM
[22172] # x
[3186] # y
[29871, 3186] # z
[22172, 3186] # r = [ x, y ]
# results with BPE
[30835] # x
[9091] # y
[1079] # z
[30835, 1079] # r = [ x, z ] |
And one more observation regarding the built-in Falcon BPE tokenization in We currently know that it does not handle Unicode strings correctly. However, running the unit test, we can see that doing a roundtrip of (tokenize + detokenize) we always get the "source" string: $ ./bin/test-tokenizer-0-falcon ../models/falcon-7b/ggml-model-f16.gguf
src: ' '
res: ' '
tok: 204
src: ' '
res: ' '
tok: 258
src: ' '
res: ' '
tok: 466
src: ' Hello'
res: ' Hello'
tok: 466 23090
src: ' Hello
Hello'
res: ' Hello
Hello'
tok: 466 23090 742 23090
src: ' Hello'
res: ' Hello'
tok: 258 23090
src: ' Hello'
res: ' Hello'
tok: 204 23090
src: ' Hello'
res: ' Hello'
tok: 23090
src: ' Hello World'
res: ' Hello World'
tok: 23090 2889
src: ' Hello World!'
res: ' Hello World!'
tok: 23090 2889 12
src: ' Hello world'
res: ' Hello world'
tok: 23090 1079
src: ' Hello, world!'
res: ' Hello, world!'
tok: 23090 23 1079 12
src: ' this is 🦙.cpp'
res: ' this is 🦙.cpp'
tok: 414 304 204 182 237 111 231 25 29247
main : failed test: ' this is 🦙.cpp'
main : detokenized to: ' this is 🦙.cpp' instead of ' this is 🦙.cpp'
main : expected tokens: 414, 304, 3346, 111, 231, 25, 29247,
main : got tokens: 414, 304, 204, 182, 237, 111, 231, 25, 29247,
src: 'Hello'
res: 'Hello'
tok: 9856
src: 'Hello World'
res: 'Hello World'
tok: 9856 2889
src: 'Hello world'
res: 'Hello world'
tok: 9856 1079
src: 'Hello, world!'
res: 'Hello, world!'
tok: 9856 23 1079 12
src: 'w048 7tuijk dsdfhu'
res: 'w048 7tuijk dsdfhu'
tok: 98 55866 204 34 16682 7149 36190 6869 11481
src: 'нещо на Български'
res: 'нещо на Български'
tok: 150 133 6207 151 215 150 134 204 150 133 6279 204 150 223 151 216 15401 150 123 6279 9310 9649 16153 7795
main : failed test: 'нещо на Български'
main : detokenized to: 'нещо на Български' instead of 'нещо на Български'
main : expected tokens: 150, 133, 6207, 151, 215, 150, 134, 5052, 133, 6279, 5052, 223, 151, 216, 49679, 123, 53110, 47043, 7795,
main : got tokens: 150, 133, 6207, 151, 215, 150, 134, 204, 150, 133, 6279, 204, 150, 223, 151, 216, 15401, 150, 123, 6279, 9310, 9649, 16153, 7795,
src: 'កាន់តែពិសេសអាចខលចេញ'
res: 'កាន់តែពិសេសអាចខលចេញ'
tok: 167 236 206 167 236 126 167 236 225 167 237 217 167 236 221 167 237 208 167 236 228 167 236 127 167 236 237 167 237 207 167 236 237 167 236 107 167 236 126 167 236 211 167 236 207 167 236 233 167 236 211 167 237 207 167 236 215
main : failed test: 'កាន់តែពិសេសអាចខលចេញ'
main : detokenized to: 'កាន់តែពិសេសអាចខលចេញ' instead of 'កាន់តែពិសេសអាចខលចេញ'
main : expected tokens: 38154, 206, 38154, 126, 38154, 225, 167, 237, 217, 38154, 221, 167, 237, 208, 38154, 228, 38154, 127, 38154, 237, 167, 237, 207, 38154, 237, 38154, 107, 38154, 126, 38154, 211, 38154, 207, 38154, 233, 38154, 211, 167, 237, 207, 38154, 215,
main : got tokens: 167, 236, 206, 167, 236, 126, 167, 236, 225, 167, 237, 217, 167, 236, 221, 167, 237, 208, 167, 236, 228, 167, 236, 127, 167, 236, 237, 167, 237, 207, 167, 236, 237, 167, 236, 107, 167, 236, 126, 167, 236, 211, 167, 236, 207, 167, 236, 233, 167, 236, 211, 167, 237, 207, 167, 236, 215,
src: '🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
res: '🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
tok: 182 237 232 206 204 19 11003 20 204 182 237 230 126 168 206 219 182 237 218 116 13392 204 19 51831 732 63209 1741 7955 522 20 204 57196 204 19 7927 53360 325 504 701 946 10930 20
main : failed test: '🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
main : detokenized to: '🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)' instead of '🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
main : expected tokens: 2571, 232, 206, 204, 19, 11003, 20, 8196, 126, 283, 219, 48778, 116, 13392, 204, 19, 51831, 732, 63209, 1741, 7955, 522, 20, 22438, 211, 204, 19, 7927, 53360, 325, 504, 701, 946, 10930, 20,
main : got tokens: 182, 237, 232, 206, 204, 19, 11003, 20, 204, 182, 237, 230, 126, 168, 206, 219, 182, 237, 218, 116, 13392, 204, 19, 51831, 732, 63209, 1741, 7955, 522, 20, 204, 57196, 204, 19, 7927, 53360, 325, 504, 701, 946, 10930, 20, However, the obtained string is sometimes constructed from a different set of tokens that does not match the correct set that we obtain with the OG tokenizer. This happens when unicode characters are involved. All of this is expected and indicates that for now, projects should avoid |
I don't have an accurate measure for timing for tokenizing input in If there's slowness, then I'd notice on Android unless my device is unaffected. |
* tests : write a Python tokenizer test (wip) * llama : prefix input text for tokenization with whitespace * llama : distinguish pieces from decoded text + fix detokenization * common : add comments * examples : no longer manually add leading space when tokenizing * tests : use Python to generate tokenizer tests for C++ * tests : add option to tokenize text files ggml-ci * tests : add test-tokenizer-1.py * llama.cpp : fix LF token * hellaswag : move the concat space for clarity * tests : add falcon tests (py + cpp, currently do not pass Unicode) ggml-ci * common : temporary separate llama_detokenize calls for SPM and BPE --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
So how do I tokenize partial text for Lllama 2 now? I need |
Interesting. This is an invariant I haven't looked into yet. For now:
Invariant 1 seems to hold for Meta's tokenizer, invariant 2 breaks as soon as spaces are involved. |
If my quick check
is correct, this invariant also doesn't hold for Meta's tokenizer in the presence of spaces. |
I have found a way. I prefix the string with LF, then remove tokens up to and including the first By the way, server documentation as of now says that this should be the default behavior:
|
TLDR: after this PR, devs should no longer add in the user code a leading space to the text they need to tokenize
In other words, if your code uses
llama.cpp
library for tokenization and is doing this:It should now just do this:
note: by "user" below I mean developers. I.e. users of
llama.cpp
libraryThis PR should not change the tokenization results from
master
- it just improves the tokenizer API and fixes the decoding logic and the reason for the leading space in the detokenized results. It adds a few more tokenizer tests and a helper Python script to compare the behaviour of the 2 implementations. Also confirmed thatwiki.test.raw
tokenizes to335688
tokens in exactly the same way as the OG Python tokenizer and fixed a minor bug inperplexity
tool.The main mistake we have been doing is to leave to the user to add the leading whitespace before calling
llama_tokenize()
. For example:llama.cpp/examples/main/main.cpp
Lines 198 to 201 in 04f4b1e
llama.cpp/examples/server/server.cpp
Lines 288 to 290 in 04f4b1e
llama.cpp/examples/server/server.cpp
Lines 310 to 314 in 04f4b1e
This leading space is indeed needed since without it, we do not get the same results as the OG Python tokenizer. It does not matter if the provided text has one or more on none leading spaces - we always must prefix it with a leading space. Therefore, this has to be part of the internal implementation to avoid the user (developer) forgetting to do it:
llama.cpp/llama.cpp
Lines 3391 to 3398 in e4324cb
For example, we have technically always had a mistake in the
perplexity
tool because we haven't added the whitespace there:llama.cpp/examples/perplexity/perplexity.cpp
Lines 190 to 197 in 04f4b1e
This shouldn't lead to any dramatic changes in the results, but it's better to have it the right way. With the current PR I verified that the produced tokens by
llama.cpp
match exactly the ones from Python forwiki.text.raw
:I've also update the C-style API with better name of the de-tokenizer function:
llama_token_to_str -> llama_token_to_piece
:llama.cpp/llama.h
Lines 384 to 393 in e4324cb
The
common
lib now provides allama_detokenize()
function that correctly detokenizes a vector of tokens into text, without adding a leading space at the start:llama.cpp/common/common.h
Lines 119 to 137 in e4324cb
The logic is that we have to drop the leading space of the first non-BOS token when detokenizing:
llama.cpp/common/common.cpp
Lines 750 to 768 in e4324cb
This function should behave exactly the same as the Python's
tokenizer.decode()
.Notice that in
main
we still get a leading space inserted when we print out the input prompt. This is because we decode the generated text token by token - i.e. piece by piece. So we don't have the information which token is first. We can add a state tomain.cpp
to track this, but I think it is not worth it.Another small bug fix is that we now correctly tokenize the empty string
''
to[1]
withbos == true
and to[]
withbos == false
Please give this branch a thorough test to make sure I haven't made some mistake and lets hope we have finally resolved all tokenizer related issues :)
TLDR: after this PR, devs should no longer add in the user code a leading space to the text they need to tokenize as we have been doing up until now