-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add phi 3 chat template #6857
Add phi 3 chat template #6857
Conversation
#6852 was merged in to master! @tristandruyen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only issue I've noticed is that it seems to output some extra <|end|> tokens at the end of responses on the /v1/chat/completions endpoint.
The rendering of special tokens like <|end|>
is now optional on the llama.cpp
API level:
Lines 827 to 838 in 3fec68b
// Token Id -> Piece. | |
// Uses the vocabulary in the provided context. | |
// Does not write null terminator to the buffer. | |
// User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens. | |
// @param special If true, special tokens are rendered in the output. | |
LLAMA_API int32_t llama_token_to_piece( | |
const struct llama_model * model, | |
llama_token token, | |
char * buf, | |
int32_t length, | |
bool special); | |
The examples in llama.cpp
currently opt-in to render these tokens, since I think it is useful to see them for debugging purposes. Projects using llama.cpp
can decide wether to render or not by setting the special
argument of llama_token_to_piece
accordingly
At some point we can add CLI args to control this in the llama.cpp
examples if necessary
Please update the Wiki accordingly |
Using latest commit,
|
I'm having the exact same issue @isgallagher. Were you able to find a fix or workaround? |
I ended up manually adding a ton of stop sequences and it worked most of the time but I kept getting weird edge case ones that I had to add so I gave up. |
Damn, I'll try with stop sequences for now. Any idea what is happening exactly? Is it a llama.cpp issue or a model issue? Edit: I found a possible workaround. Have it purposely communicate in JSON using |
@@ -49,6 +49,8 @@ int main(void) { | |||
"{{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true %}{% set loop_messages = messages %}{% set system_message = 'You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses. You are trained by Cohere.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'assistant' %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}{% endif %}", | |||
// Llama-3 | |||
"{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}", | |||
// Phi-3 | |||
"{{ bos_token }}{% for message in messages %}{{'<|' + message['role'] + '|>' + ' ' + message['content'] + '<|end|> ' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|> ' }}{% else %}{{ eos_token }}{% endif %}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tristandruyen This template uses a space character ' '
after each special token, but on your cpp version, you used new line "\n"
. Can you check if it is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the Chat Format documented in the huggingface README's there should be a \n
after <|user|>
and <|end|>
https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/README.md#chat-format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While they do not explicitly state that there should be a \n
after <|assistant|>
the model nearly always generates a \n
as a first token when it is left out...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the Chat Format documented in the huggingface README's there should be a \n after <|user|> and <|end|>
It's best to look at the code directly rather than README. Sometimes human copy-paste can have mistakes.
On the same repo that you linked to, you can see the jinja template code here:
https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/tokenizer_config.json#L119
This file have '<|user|>' + '\n'
which confirms that there should be a new line after special tokens.
However, that does not explain why your jinja version does not have it. Remember, jinja templates are deterministic, meaning your cpp code need to follow 100% what is being coded inside the template. This mean we can fix this PR in one of 2 ways: (1) correct the jinja template to have \n
or (2) correct the cpp implementation output ' '
instead of new line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems there are even more version of the template, I guess i had an outdated tokeinzer_config.json when making my GGUF's, because I just took the jinja template from the gguf metadata of my recent phi3 quants, which should just be a copy of what was in the tokenizer_config.json, shouldn't it ?
I think I found the reason, the huggingface GUI for displaying GGUF metadata, seems to replace \n with whitespace for some weird reason....
I'll update the template with the correct one....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: there are indeed 2 versions, but they all use new line:
- One with
'<|' + message['role'] + '|>' + '\n'
: https://huggingface.co/microsoft/Phi-3-small-128k-instruct/blob/main/tokenizer_config.json#L10 - And another one with explicitly
<|user|>
: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/tokenizer_config.json#L119
I think it's best to somehow scan all template in existing phi-3 family, then make a fix PR. Without careful research, I doubt that there will be more confusions in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found the reason, the huggingface GUI for displaying GGUF metadata, seems to replace \n with whitespace for some weird reason....
I'd agree that there is a bad design in the jinja templates (in general). Since '\n'
is wrapped inside a json string, it should be converted to " '\\n' "
. But for some reasons, jinja doesn't care about this mistake (that's why it get converted to space in the gguf viewer on huggingface).
Unfortunately, this is not something we could fix on our side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tristandruyen Please have a look of complete list of phi-3 templates: https://gist.github.com/ngxson/38d8d89fb5d856187bcc70e78b1cc4f5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the list.
So it seems to me like these are 4 different ways to get basically the same result.
There are some difference but AFAIK they should not matter here:
- the bos token for small & mini, should be handled by
tokenizer.ggml.add_bos_token
- the eos token for mini should be handled likewise by add_eos_token
So the current c++ code seems to do the correct thing in my understanding.
I think adding all variants as test would be good, the current matching via <|assistant|>
and <|end|>
should work for all of them.
This adds the phi 3 chat template.
Works mostly fine in my testing with the commits from #6851 cherry-picked for quantizing
Only issue I've noticed is that it seems to output some extra <|end|> tokens at the end of responses on the /v1/chat/completions endpoint.
Although in my understanding that's a tokenization issue, that should be solved in #6852 or #6851 ?