-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Phi-3 Tokenizer Adds Whitespaces on re-tokenization (which invalidates KV-cache) #7938
Comments
I think that this is happening on tokenisation - if I try to tokenise a single space |
I believe this is the same issue I've raised in: |
I'm actually trying to fix similar issues. |
We have a work around, based on prepending a known byte which is rather unlikely to appear in a real string or token. |
Just stumbled into this error.
Also
29871 IS actually the single whitespace, but it had no place there.. I stumbled upon that issue because phi3 generates unexpected whitespaces at the begin of the response. The cause is in llama.cpp:
This snippet is part of llama_tokenize_internal() and if This appears questionable to me, Is that really intended behavior ? It appears to be wrong for PHI3, given how strangely it reacts. Someone familiar with llama tokenization and this specific feature could maybe add some input, it appears "prefix space" is a common thing among many tokenizers, but this adds a space after every single finetune special token (end, assistant, user, system) then this looks really odd to me - especially if it modifies a single whitespace into a double whitespace token. |
If I'm not mistaken the reason for the prefix is because most models don't interpret the initial token correctly, so this was used to pad it. The value of the first token shouldn't matter so long as the model was trained to ignore it. Shouldn't being the key word there as I've never actually tested it. |
There is a lot more than that going on. whitespace prefixes are added to all tokens after any special tokens - if none are present already. |
@cmp-nct What you describe is exactly the issue I'm facing When I feed a text block that contains new lines into the Phi-3 tokeniser, the new lines are removed after decoding. Here is an example of the text I am working with:
after tokenizer.decode I got this:
Can you help me with this issue and is it affecting the performance of the model if I proceed with this ? |
Phi-3 tokenizer removes all whitespaces (spaces, new lines, tabs, etc) after this special tokens. { "content": "<|system|>", "lstrip": false, "rstrip": true },
{ "content": "<|user|>", "lstrip": false, "rstrip": true },
{ "content": "<|assistant|>", "lstrip": false, "rstrip": true },
{ "content": "<|end|>", "lstrip": false, "rstrip": true }, Testing: dir_tokenizer = "./models/tokenizers/phi-3/"
tokenizer = AutoTokenizer.from_pretrained(dir_tokenizer)
text1 = "<|system|>\nFoo bar<|end|>\n<|user|>Baz qux?<|end|>\n<|assistant|>"
text2 = "<|system|>\n \n \tFoo bar<|end|>\n \n \t<|user|>\n \n \tBaz qux?<|end|>\n \n \t<|assistant|>\n \n \t"
tokens1 = tokenizer.encode( text1 )
tokens2 = tokenizer.encode( text2 )
retext1 = tokenizer.decode( tokens1 )
retext2 = tokenizer.decode( tokens2 )
print( repr(text1) )
print( repr(text2) )
print( repr(tokens1) )
print( repr(tokens2) )
print( repr(retext1) )
print( repr(retext2) ) Output: '<|system|>\nFoo bar<|end|>\n<|user|>Baz qux?<|end|>\n<|assistant|>'
'<|system|>\n \n \tFoo bar<|end|>\n \n \t<|user|>\n \n \tBaz qux?<|end|>\n \n \t<|assistant|>\n \n \t'
[32006, 13679, 2594, 32007, 32010, 350, 834, 439, 29916, 29973, 32007, 32001]
[32006, 13679, 2594, 32007, 32010, 350, 834, 439, 29916, 29973, 32007, 32001]
'<|system|> Foo bar<|end|><|user|> Baz qux?<|end|><|assistant|>'
'<|system|> Foo bar<|end|><|user|> Baz qux?<|end|><|assistant|>' As you can see, But I'm not sure how are you loading the |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
What happened?
The llama.cpp tokenizer for Phi-3 has odd behavior, where re-tokenizing the same text over and over keeps adding whitespaces to the first non-BOS token. This has several issues:
Transformers
I maintain the Guidance library (https://github.com/guidance-ai/guidance), where we often need to re-tokenize inputs after adding any templated/deterministic text from the user. This is causing a significant performance regression in Phi-3 usage via llama.cpp on guidance whenever we go through this cycle :(. I believe pretty much all constrained generation libraries would likely affected by this too.
Here's an example of the bug in action (using the llama-cpp-python bindings, which are very thin wrappers on the tokenizer)
The model I'm using: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-q4.gguf
Note how the token at index
1
has a continually growing whitespace when going through the tokenize/detokenize cycle. Repeating this process continuously increases the whitespace (" Hi I am a hippo"
->" Hi I am a hippo"
->" Hi I am a hippo"
->" Hi I am a hippo"
...This is the heart of the issue, and doesn't happen with the original tokenizer implementation in Transformers.
Name and Version
llama-cpp-python is using this commit for their latest release: fd5ea0f
What operating system are you seeing the problem on?
Linux, Mac, Windows
Relevant log output
No response
The text was updated successfully, but these errors were encountered: