Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add phi 3 chat template #6857

Merged
merged 2 commits into from
Apr 24, 2024
Merged

Add phi 3 chat template #6857

merged 2 commits into from
Apr 24, 2024

Conversation

tristandruyen
Copy link
Contributor

@tristandruyen tristandruyen commented Apr 23, 2024

This adds the phi 3 chat template.

Works mostly fine in my testing with the commits from #6851 cherry-picked for quantizing

Only issue I've noticed is that it seems to output some extra <|end|> tokens at the end of responses on the /v1/chat/completions endpoint.
Although in my understanding that's a tokenization issue, that should be solved in #6852 or #6851 ?

@tristandruyen tristandruyen changed the title Add phi 3 chat template & tests Add phi 3 chat template Apr 23, 2024
@lukestanley
Copy link

lukestanley commented Apr 24, 2024

#6852 was merged in to master! @tristandruyen

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only issue I've noticed is that it seems to output some extra <|end|> tokens at the end of responses on the /v1/chat/completions endpoint.

The rendering of special tokens like <|end|> is now optional on the llama.cpp API level:

llama.cpp/llama.h

Lines 827 to 838 in 3fec68b

// Token Id -> Piece.
// Uses the vocabulary in the provided context.
// Does not write null terminator to the buffer.
// User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens.
// @param special If true, special tokens are rendered in the output.
LLAMA_API int32_t llama_token_to_piece(
const struct llama_model * model,
llama_token token,
char * buf,
int32_t length,
bool special);

The examples in llama.cpp currently opt-in to render these tokens, since I think it is useful to see them for debugging purposes. Projects using llama.cpp can decide wether to render or not by setting the special argument of llama_token_to_piece accordingly

At some point we can add CLI args to control this in the llama.cpp examples if necessary

@ggerganov
Copy link
Owner

Please update the Wiki accordingly

@ggerganov ggerganov merged commit abd3314 into ggerganov:master Apr 24, 2024
40 of 58 checks passed
@tristandruyen tristandruyen deleted the add-phi3-chat-template branch April 24, 2024 12:11
@isgallagher
Copy link

Using latest commit, ./server -m Phi-3-mini-4k-instruct-q4.gguf --chat-template phi3 I am getting weirdness...

User: Hello
Llama: Hi there! I'm Llama, your friendly chatbot. How can I help you today?
<|assistant|> Hello! It's great to meet you. I'm here and ready to assist with any questions or tasks you have in mind. Just let me know what you need assistance with!

@AayushG159
Copy link

I'm having the exact same issue @isgallagher. Were you able to find a fix or workaround?

@isgallagher
Copy link

I ended up manually adding a ton of stop sequences and it worked most of the time but I kept getting weird edge case ones that I had to add so I gave up.

@AayushG159
Copy link

AayushG159 commented May 5, 2024

Damn, I'll try with stop sequences for now. Any idea what is happening exactly? Is it a llama.cpp issue or a model issue?

Edit: I found a possible workaround. Have it purposely communicate in JSON using --grammar-file .\llama.cpp\grammars\json.gbnf and also use Locally Typical Sampling (where high values promote contextually coherent tokens) with --typical 0.9

@@ -49,6 +49,8 @@ int main(void) {
"{{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true %}{% set loop_messages = messages %}{% set system_message = 'You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses. You are trained by Cohere.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'assistant' %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}{% endif %}",
// Llama-3
"{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}",
// Phi-3
"{{ bos_token }}{% for message in messages %}{{'<|' + message['role'] + '|>' + ' ' + message['content'] + '<|end|> ' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|> ' }}{% else %}{{ eos_token }}{% endif %}"
Copy link
Collaborator

@ngxson ngxson May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tristandruyen This template uses a space character ' ' after each special token, but on your cpp version, you used new line "\n". Can you check if it is correct?

Copy link
Contributor Author

@tristandruyen tristandruyen May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Chat Format documented in the huggingface README's there should be a \n after <|user|> and <|end|>

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/README.md#chat-format

Copy link
Contributor Author

@tristandruyen tristandruyen May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While they do not explicitly state that there should be a \n after <|assistant|> the model nearly always generates a \n as a first token when it is left out...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Chat Format documented in the huggingface README's there should be a \n after <|user|> and <|end|>

It's best to look at the code directly rather than README. Sometimes human copy-paste can have mistakes.

On the same repo that you linked to, you can see the jinja template code here:

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/tokenizer_config.json#L119

This file have '<|user|>' + '\n' which confirms that there should be a new line after special tokens.

However, that does not explain why your jinja version does not have it. Remember, jinja templates are deterministic, meaning your cpp code need to follow 100% what is being coded inside the template. This mean we can fix this PR in one of 2 ways: (1) correct the jinja template to have \n or (2) correct the cpp implementation output ' ' instead of new line.

Copy link
Contributor Author

@tristandruyen tristandruyen May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems there are even more version of the template, I guess i had an outdated tokeinzer_config.json when making my GGUF's, because I just took the jinja template from the gguf metadata of my recent phi3 quants, which should just be a copy of what was in the tokenizer_config.json, shouldn't it ?

I think I found the reason, the huggingface GUI for displaying GGUF metadata, seems to replace \n with whitespace for some weird reason....
I'll update the template with the correct one....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: there are indeed 2 versions, but they all use new line:

I think it's best to somehow scan all template in existing phi-3 family, then make a fix PR. Without careful research, I doubt that there will be more confusions in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found the reason, the huggingface GUI for displaying GGUF metadata, seems to replace \n with whitespace for some weird reason....

I'd agree that there is a bad design in the jinja templates (in general). Since '\n' is wrapped inside a json string, it should be converted to " '\\n' ". But for some reasons, jinja doesn't care about this mistake (that's why it get converted to space in the gguf viewer on huggingface).

Unfortunately, this is not something we could fix on our side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tristandruyen Please have a look of complete list of phi-3 templates: https://gist.github.com/ngxson/38d8d89fb5d856187bcc70e78b1cc4f5

Copy link
Contributor Author

@tristandruyen tristandruyen May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the list.
So it seems to me like these are 4 different ways to get basically the same result.
There are some difference but AFAIK they should not matter here:

  • the bos token for small & mini, should be handled by tokenizer.ggml.add_bos_token
  • the eos token for mini should be handled likewise by add_eos_token

So the current c++ code seems to do the correct thing in my understanding.

I think adding all variants as test would be good, the current matching via <|assistant|> and <|end|> should work for all of them.

@mofosyne mofosyne added model Model specific Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants