Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Or Feature? BPE Tokenization mutates whitespaces into double-whitespace tokens when add_prefix_space is true (default) #8023

Closed
cmp-nct opened this issue Jun 20, 2024 · 2 comments
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) stale

Comments

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 20, 2024

What happened?

This is a bit discussed here already: #7938
<|assistant|>

32001 -> '<|assistant|>'
   259 -> '  '

Also <|assistant|>\n:

32001 -> '<|assistant|>'
29871 -> ' '
    13 -> '
'

What happens is that the single whitespace, that follows a special token is mutated into a double-whitespace token (259) because add_prefix_space is triggered in llama.cpp when a special token is encountered.

In the second example the template actually wants a \n after assistant, however the special behavior sneaks a space in between.

Is this intended behavior / correct ?

When running PHI3 and asking for a generation after <|assistant|>, phi3 is adamant in responding with a whitespace or a combination token that starts with a whitespace.
When disabling add_prefix_whitespace and adding a \n after assistant, this issue is resolved and phi responds right away with normal text.

Name and Version

ba58993

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

@cmp-nct cmp-nct added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Jun 20, 2024
@cmp-nct cmp-nct changed the title Bug: Or Feature? Tokenization mutates whitespaces into double-whitespace tokens when add_prefix_space is true (default) Bug: Or Feature? BPE Tokenization mutates whitespaces into double-whitespace tokens when add_prefix_space is true (default) Jun 20, 2024
@jaime-m-p jaime-m-p mentioned this issue Jun 20, 2024
4 tasks
@github-actions github-actions bot added the stale label Jul 21, 2024
@davidgxue
Copy link

Hey @cmp-nct were you able to find a solution to this? I think I am running into the same issue where my finetuning data does not have the space in front at all, but every response would always come back with a whitespace prefixed (or right after <|assistant|>)

@github-actions github-actions bot removed the stale label Jul 31, 2024
@github-actions github-actions bot added the stale label Aug 30, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) stale
Projects
None yet
Development

No branches or pull requests

2 participants