Bug: Phi-3 Tokenizer Adds Whitespaces on re-tokenization (which invalidates KV-cache) #7938

Harsha-Nori · 2024-06-14T14:44:22Z

What happened?

The llama.cpp tokenizer for Phi-3 has odd behavior, where re-tokenizing the same text over and over keeps adding whitespaces to the first non-BOS token. This has several issues:

It doesn't match the original tokenizer behavior from Huggingface Transformers
Re-processing the same text causes the kv-cache to be invalidated, forcing another prompt fill of all the input tokens.

I maintain the Guidance library (https://github.com/guidance-ai/guidance), where we often need to re-tokenize inputs after adding any templated/deterministic text from the user. This is causing a significant performance regression in Phi-3 usage via llama.cpp on guidance whenever we go through this cycle :(. I believe pretty much all constrained generation libraries would likely affected by this too.

Here's an example of the bug in action (using the llama-cpp-python bindings, which are very thin wrappers on the tokenizer)

The model I'm using: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-q4.gguf

import llama_cpp
print(llama_cpp.__version__) # '0.2.78' -- filing here because it seems like it's a lower level bug in llama.cpp

model = llama_cpp.Llama(model_path="Phi-3-mini-4k-instruct-q4.gguf", logits_all=True)
tokenizer = llama_cpp.LlamaTokenizer(model)

test_str = "Hi I am a hippo"
test_tokens = tokenizer.tokenize(test_str.encode("utf-8")) # [1, 6324, 306, 626, 263, 7251, 9759]

retokenized = b''.join([tokenizer.detokenize([i]) for i in test_tokens]) # b' Hi I am a hippo'
retokenized_tokens = tokenizer.tokenize(retokenized) # [1, 29871, 6324, 306, 626, 263, 7251, 9759]

retokenized2 = b''.join([tokenizer.detokenize([i]) for i in retokenized_tokens]) # b'  Hi I am a hippo'

Note how the token at index 1 has a continually growing whitespace when going through the tokenize/detokenize cycle. Repeating this process continuously increases the whitespace (

" Hi I am a hippo" ->
" Hi I am a hippo" ->
" Hi I am a hippo" ->
" Hi I am a hippo" ...

This is the heart of the issue, and doesn't happen with the original tokenizer implementation in Transformers.

Name and Version

llama-cpp-python is using this commit for their latest release: fd5ea0f

What operating system are you seeing the problem on?

Linux, Mac, Windows

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

riedgar-ms · 2024-06-14T14:58:47Z

I think that this is happening on tokenisation - if I try to tokenise a single space ' ' I'm getting back token 259 (when using 'mistral-7b-instruct-v0.2.Q8_0.gguf' specifically), but that's a double space.

riedgar-ms · 2024-06-14T14:59:13Z

I believe this is the same issue I've raised in:
abetlen/llama-cpp-python#1531

jaime-m-p · 2024-06-15T15:35:04Z

I'm actually trying to fix similar issues.
Let me check and see if I can fix.

riedgar-ms · 2024-06-15T16:50:55Z

We have a work around, based on prepending a known byte which is rather unlikely to appear in a real string or token.

cmp-nct · 2024-06-20T00:46:53Z

Just stumbled into this error.
@riedgar-ms correctly pointed out that a single space is wrongly encoded as double space
<|assistant|>

32001 -> '<|assistant|>'
   259 -> '  '

Also <|assistant|>\n is bizzarely tokenized:

32001 -> '<|assistant|>'
29871 -> ' '
    13 -> '
'

29871 IS actually the single whitespace, but it had no place there..

I stumbled upon that issue because phi3 generates unexpected whitespaces at the begin of the response.
The template is supposed to end with <|assistant|>, the response in 80-90% of all cases starts with a whitespace.
I tried to compensate that by adding a whitespace myself, this triggered the double-space tokenization error and phi3 added another whitespace.
Phi3 is quite adamant, when asking it to write a single # it response with # as a single token, that looks like a deeper fine tune issue to me.. The model seems aware that # has a whitespace at the start, despite being a single token, it REALLY wants to use whitespaces to start.

The cause is in llama.cpp:

if (vocab.add_space_prefix) {
                            if (!output.size() || is_prev_special) {  // prefix with space if first token
                                raw_text = " " + raw_text;
                            }
                        }

This snippet is part of llama_tokenize_internal() and if add_space_prefix is true (which is the case) then it will squeeze in a whitespace into the start of every encoding as well as after every special token.
Further: The phi3 tokenizer_config.json does not contain add_prefix_space, which makes it default to true in llama.cpp

This appears questionable to me, Is that really intended behavior ? It appears to be wrong for PHI3, given how strangely it reacts.

Someone familiar with llama tokenization and this specific feature could maybe add some input, it appears "prefix space" is a common thing among many tokenizers, but this adds a space after every single finetune special token (end, assistant, user, system) then this looks really odd to me - especially if it modifies a single whitespace into a double whitespace token.
I'll add an issue for that, maybe it's a bug

Haus1 · 2024-06-23T09:19:37Z

If I'm not mistaken the reason for the prefix is because most models don't interpret the initial token correctly, so this was used to pad it. The value of the first token shouldn't matter so long as the model was trained to ignore it. Shouldn't being the key word there as I've never actually tested it.

GPT2 always has largest attention on first token?

cmp-nct · 2024-06-23T16:22:23Z

If I'm not mistaken the reason for the prefix is because most models don't interpret the initial token correctly, so this was used to pad it. The value of the first token shouldn't matter so long as the model was trained to ignore it. Shouldn't being the key word there as I've never actually tested it.

GPT2 always has largest attention on first token?

There is a lot more than that going on. whitespace prefixes are added to all tokens after any special tokens - if none are present already.
In addition newline tokens get removed if they come after special tokens.
So there is a lot of modifications going on in the python tokenizer, I doubt that most original developers know about that actually.
If you look at some fine tunes, they use newlines after special tokens. However the python tokenizer would just remove those newlines

JhonDan1999 · 2024-07-16T19:48:38Z

If I'm not mistaken the reason for the prefix is because most models don't interpret the initial token correctly, so this was used to pad it. The value of the first token shouldn't matter so long as the model was trained to ignore it. Shouldn't being the key word there as I've never actually tested it.
GPT2 always has largest attention on first token?

There is a lot more than that going on. whitespace prefixes are added to all tokens after any special tokens - if none are present already. In addition newline tokens get removed if they come after special tokens. So there is a lot of modifications going on in the python tokenizer, I doubt that most original developers know about that actually. If you look at some fine tunes, they use newlines after special tokens. However the python tokenizer would just remove those newlines

@cmp-nct What you describe is exactly the issue I'm facing

When I feed a text block that contains new lines into the Phi-3 tokeniser, the new lines are removed after decoding. Here is an example of the text I am working with:
Input Text to the tokenizer:

<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>

after tokenizer.decode I got this:

<|system|>
You are a helpful assistant.<|end|><|user|>
How to explain Internet for a medieval knight?<|end|><|assistant|>

Can you help me with this issue and is it affecting the performance of the model if I proceed with this ?

jaime-m-p · 2024-07-18T18:14:33Z

@JhonDan1999

Phi-3 tokenizer removes all whitespaces (spaces, new lines, tabs, etc) after this special tokens.
See rstrip attributes in ./models/tokenizers/phi-3/tokenizer.json:

    { "content": "<|system|>",     "lstrip": false, "rstrip": true },
    { "content": "<|user|>",       "lstrip": false, "rstrip": true },
    { "content": "<|assistant|>",  "lstrip": false, "rstrip": true },
    { "content": "<|end|>",        "lstrip": false, "rstrip": true },

Testing:

dir_tokenizer = "./models/tokenizers/phi-3/"
tokenizer = AutoTokenizer.from_pretrained(dir_tokenizer)

text1 = "<|system|>\nFoo bar<|end|>\n<|user|>Baz qux?<|end|>\n<|assistant|>"
text2 = "<|system|>\n \n \tFoo bar<|end|>\n \n \t<|user|>\n \n \tBaz qux?<|end|>\n \n \t<|assistant|>\n \n \t"

tokens1 = tokenizer.encode( text1 )
tokens2 = tokenizer.encode( text2 )
retext1 = tokenizer.decode( tokens1 )
retext2 = tokenizer.decode( tokens2 )

print( repr(text1) )
print( repr(text2) )
print( repr(tokens1) )
print( repr(tokens2) )
print( repr(retext1) )
print( repr(retext2) )

Output:

'<|system|>\nFoo bar<|end|>\n<|user|>Baz qux?<|end|>\n<|assistant|>'
'<|system|>\n \n \tFoo bar<|end|>\n \n \t<|user|>\n \n \tBaz qux?<|end|>\n \n \t<|assistant|>\n \n \t'
[32006, 13679, 2594, 32007, 32010, 350, 834, 439, 29916, 29973, 32007, 32001]
[32006, 13679, 2594, 32007, 32010, 350, 834, 439, 29916, 29973, 32007, 32001]
'<|system|> Foo bar<|end|><|user|> Baz qux?<|end|><|assistant|>'
'<|system|> Foo bar<|end|><|user|> Baz qux?<|end|><|assistant|>'

As you can see, text1 and text2 produces same tokens and same detokenized texts.
So I guess threre is no performance difference.

But I'm not sure how are you loading the tokenizer since you are getting \n after <|system|>.

github-actions · 2024-09-02T01:07:17Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Harsha-Nori added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jun 14, 2024

cmp-nct mentioned this issue Jun 20, 2024

Bug: Or Feature? BPE Tokenization mutates whitespaces into double-whitespace tokens when add_prefix_space is true (default) #8023

Closed

jaime-m-p mentioned this issue Jun 20, 2024

Detokenizer fixes #8039

Merged

4 tasks

ThiloteE mentioned this issue Aug 3, 2024

Models: Add Phi-3.1-mini-128k-instruct nomic-ai/gpt4all#2790

Open

10 tasks

github-actions bot added the stale label Aug 18, 2024

github-actions bot closed this as completed Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Phi-3 Tokenizer Adds Whitespaces on re-tokenization (which invalidates KV-cache) #7938

Bug: Phi-3 Tokenizer Adds Whitespaces on re-tokenization (which invalidates KV-cache) #7938

Harsha-Nori commented Jun 14, 2024

riedgar-ms commented Jun 14, 2024

riedgar-ms commented Jun 14, 2024

jaime-m-p commented Jun 15, 2024

riedgar-ms commented Jun 15, 2024

cmp-nct commented Jun 20, 2024 •

edited

Loading

Haus1 commented Jun 23, 2024 •

edited

Loading

cmp-nct commented Jun 23, 2024

JhonDan1999 commented Jul 16, 2024 •

edited

Loading

jaime-m-p commented Jul 18, 2024

github-actions bot commented Sep 2, 2024

Bug: Phi-3 Tokenizer Adds Whitespaces on re-tokenization (which invalidates KV-cache) #7938

Bug: Phi-3 Tokenizer Adds Whitespaces on re-tokenization (which invalidates KV-cache) #7938

Comments

Harsha-Nori commented Jun 14, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

riedgar-ms commented Jun 14, 2024

riedgar-ms commented Jun 14, 2024

jaime-m-p commented Jun 15, 2024

riedgar-ms commented Jun 15, 2024

cmp-nct commented Jun 20, 2024 • edited Loading

Haus1 commented Jun 23, 2024 • edited Loading

cmp-nct commented Jun 23, 2024

JhonDan1999 commented Jul 16, 2024 • edited Loading

jaime-m-p commented Jul 18, 2024

github-actions bot commented Sep 2, 2024

cmp-nct commented Jun 20, 2024 •

edited

Loading

Haus1 commented Jun 23, 2024 •

edited

Loading

JhonDan1999 commented Jul 16, 2024 •

edited

Loading