Bug: Or Feature? BPE Tokenization mutates whitespaces into double-whitespace tokens when add_prefix_space is true (default) #8023
Labels
bug-unconfirmed
low severity
Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
stale
What happened?
This is a bit discussed here already: #7938
<|assistant|>
Also
<|assistant|>\n
:What happens is that the single whitespace, that follows a special token is mutated into a double-whitespace token (259) because add_prefix_space is triggered in llama.cpp when a special token is encountered.
In the second example the template actually wants a \n after assistant, however the special behavior sneaks a space in between.
Is this intended behavior / correct ?
When running PHI3 and asking for a generation after
<|assistant|>
, phi3 is adamant in responding with a whitespace or a combination token that starts with a whitespace.When disabling add_prefix_whitespace and adding a
\n
after assistant, this issue is resolved and phi responds right away with normal text.Name and Version
ba58993
What operating system are you seeing the problem on?
Windows
Relevant log output
No response
The text was updated successfully, but these errors were encountered: