Add phi 3 chat template #6857

tristandruyen · 2024-04-23T23:28:56Z

This adds the phi 3 chat template.

Works mostly fine in my testing with the commits from #6851 cherry-picked for quantizing

Only issue I've noticed is that it seems to output some extra <|end|> tokens at the end of responses on the /v1/chat/completions endpoint.
Although in my understanding that's a tokenization issue, that should be solved in #6852 or #6851 ?

lukestanley · 2024-04-24T08:07:54Z

#6852 was merged in to master! @tristandruyen

ggerganov

Only issue I've noticed is that it seems to output some extra <|end|> tokens at the end of responses on the /v1/chat/completions endpoint.

The rendering of special tokens like <|end|> is now optional on the llama.cpp API level:

llama.cpp/llama.h

Lines 827 to 838 in 3fec68b

    
           // Token Id -> Piece. 
        
           // Uses the vocabulary in the provided context. 
        
           // Does not write null terminator to the buffer. 
        
           // User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens. 
        
           // @param special If true, special tokens are rendered in the output. 
        
           LLAMA_API int32_t llama_token_to_piece( 
        
                     const struct llama_model * model, 
        
                                  llama_token   token, 
        
                                         char * buf, 
        
                                      int32_t   length, 
        
                                         bool   special);

The examples in llama.cpp currently opt-in to render these tokens, since I think it is useful to see them for debugging purposes. Projects using llama.cpp can decide wether to render or not by setting the special argument of llama_token_to_piece accordingly

At some point we can add CLI args to control this in the llama.cpp examples if necessary

ggerganov · 2024-04-24T08:52:27Z

Please update the Wiki accordingly

isgallagher · 2024-04-26T18:51:33Z

Using latest commit, ./server -m Phi-3-mini-4k-instruct-q4.gguf --chat-template phi3 I am getting weirdness...

User: Hello
Llama: Hi there! I'm Llama, your friendly chatbot. How can I help you today?
<|assistant|> Hello! It's great to meet you. I'm here and ready to assist with any questions or tasks you have in mind. Just let me know what you need assistance with!

AayushG159 · 2024-05-05T07:07:03Z

I'm having the exact same issue @isgallagher. Were you able to find a fix or workaround?

isgallagher · 2024-05-05T15:49:50Z

I ended up manually adding a ton of stop sequences and it worked most of the time but I kept getting weird edge case ones that I had to add so I gave up.

AayushG159 · 2024-05-05T17:03:00Z

Damn, I'll try with stop sequences for now. Any idea what is happening exactly? Is it a llama.cpp issue or a model issue?

Edit: I found a possible workaround. Have it purposely communicate in JSON using --grammar-file .\llama.cpp\grammars\json.gbnf and also use Locally Typical Sampling (where high values promote contextually coherent tokens) with --typical 0.9

ngxson · 2024-05-22T09:20:36Z

tests/test-chat-template.cpp

@@ -49,6 +49,8 @@ int main(void) {
        "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true %}{% set loop_messages = messages %}{% set system_message = 'You are Command-R, a brilliant, sophisticated, AI-assistant trained to assist human users by providing thorough responses. You are trained by Cohere.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'assistant' %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'  + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}{% endif %}",
        // Llama-3
        "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}",
+        // Phi-3
+        "{{ bos_token }}{% for message in messages %}{{'<|' + message['role'] + '|>' + ' ' + message['content'] + '<|end|> ' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|> ' }}{% else %}{{ eos_token }}{% endif %}"


@tristandruyen This template uses a space character ' ' after each special token, but on your cpp version, you used new line "\n". Can you check if it is correct?

According to the Chat Format documented in the huggingface README's there should be a \n after <|user|> and <|end|>

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/README.md#chat-format

While they do not explicitly state that there should be a \n after <|assistant|> the model nearly always generates a \n as a first token when it is left out...

According to the Chat Format documented in the huggingface README's there should be a \n after <|user|> and <|end|>

It's best to look at the code directly rather than README. Sometimes human copy-paste can have mistakes.

On the same repo that you linked to, you can see the jinja template code here:

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/tokenizer_config.json#L119

This file have '<|user|>' + '\n' which confirms that there should be a new line after special tokens.

However, that does not explain why your jinja version does not have it. Remember, jinja templates are deterministic, meaning your cpp code need to follow 100% what is being coded inside the template. This mean we can fix this PR in one of 2 ways: (1) correct the jinja template to have \n or (2) correct the cpp implementation output ' ' instead of new line.

Seems there are even more version of the template, I guess i had an outdated tokeinzer_config.json when making my GGUF's, because I just took the jinja template from the gguf metadata of my recent phi3 quants, which should just be a copy of what was in the tokenizer_config.json, shouldn't it ?

I think I found the reason, the huggingface GUI for displaying GGUF metadata, seems to replace \n with whitespace for some weird reason....
I'll update the template with the correct one....

Edit: there are indeed 2 versions, but they all use new line:

One with '<|' + message['role'] + '|>' + '\n': https://huggingface.co/microsoft/Phi-3-small-128k-instruct/blob/main/tokenizer_config.json#L10

And another one with explicitly <|user|>: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/tokenizer_config.json#L119

I think it's best to somehow scan all template in existing phi-3 family, then make a fix PR. Without careful research, I doubt that there will be more confusions in the future

I think I found the reason, the huggingface GUI for displaying GGUF metadata, seems to replace \n with whitespace for some weird reason....

I'd agree that there is a bad design in the jinja templates (in general). Since '\n' is wrapped inside a json string, it should be converted to " '\\n' ". But for some reasons, jinja doesn't care about this mistake (that's why it get converted to space in the gguf viewer on huggingface).

Unfortunately, this is not something we could fix on our side.

I updated the PR :)

@tristandruyen Please have a look of complete list of phi-3 templates: https://gist.github.com/ngxson/38d8d89fb5d856187bcc70e78b1cc4f5

Thanks for the list.
So it seems to me like these are 4 different ways to get basically the same result.
There are some difference but AFAIK they should not matter here:

the bos token for small & mini, should be handled by tokenizer.ggml.add_bos_token

the eos token for mini should be handled likewise by add_eos_token

So the current c++ code seems to do the correct thing in my understanding.

I think adding all variants as test would be good, the current matching via <|assistant|> and <|end|> should work for all of them.

Add phi 3 chat template & tests

c910886

tristandruyen changed the title ~~Add phi 3 chat template & tests~~ Add phi 3 chat template Apr 23, 2024

tristandruyen mentioned this pull request Apr 23, 2024

Support for Phi-3 models #6849

Open

ggerganov approved these changes Apr 24, 2024

View reviewed changes

test : fix chat template result

992fff7

ggerganov merged commit abd3314 into ggerganov:master Apr 24, 2024
40 of 58 checks passed

tristandruyen deleted the add-phi3-chat-template branch April 24, 2024 12:11

hak8or mentioned this pull request Apr 25, 2024

I have tested 4-5 phi-3-128K-Instruct models from different providers with different quants, all GGUF files, none are runnable with ollama ollama/ollama#3894

Open

tristandruyen mentioned this pull request May 22, 2024

Fix phi3 chat template confusion with zephyr #7449

Merged

ngxson reviewed May 22, 2024

View reviewed changes

mofosyne added model Model specific Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add phi 3 chat template #6857

Add phi 3 chat template #6857

tristandruyen commented Apr 23, 2024 •

edited

Loading

lukestanley commented Apr 24, 2024 •

edited

Loading

ggerganov left a comment

ggerganov commented Apr 24, 2024

isgallagher commented Apr 26, 2024

AayushG159 commented May 5, 2024

isgallagher commented May 5, 2024

AayushG159 commented May 5, 2024 •

edited

Loading

ngxson May 22, 2024 •

edited

Loading

tristandruyen May 23, 2024 •

edited

Loading

tristandruyen May 23, 2024 •

edited

Loading

ngxson May 23, 2024

tristandruyen May 23, 2024 •

edited

Loading

ngxson May 23, 2024

ngxson May 23, 2024

tristandruyen May 23, 2024

ngxson May 23, 2024

tristandruyen May 23, 2024 •

edited

Loading

	// Token Id -> Piece.
	// Uses the vocabulary in the provided context.
	// Does not write null terminator to the buffer.
	// User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens.
	// @param special If true, special tokens are rendered in the output.
	LLAMA_API int32_t llama_token_to_piece(
	const struct llama_model * model,
	llama_token token,
	char * buf,
	int32_t length,
	bool special);

Add phi 3 chat template #6857

Add phi 3 chat template #6857

Conversation

tristandruyen commented Apr 23, 2024 • edited Loading

lukestanley commented Apr 24, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov commented Apr 24, 2024

isgallagher commented Apr 26, 2024

AayushG159 commented May 5, 2024

isgallagher commented May 5, 2024

AayushG159 commented May 5, 2024 • edited Loading

ngxson May 22, 2024 • edited Loading

Choose a reason for hiding this comment

tristandruyen May 23, 2024 • edited Loading

Choose a reason for hiding this comment

tristandruyen May 23, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson May 23, 2024

Choose a reason for hiding this comment

tristandruyen May 23, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson May 23, 2024

Choose a reason for hiding this comment

ngxson May 23, 2024

Choose a reason for hiding this comment

tristandruyen May 23, 2024

Choose a reason for hiding this comment

ngxson May 23, 2024

Choose a reason for hiding this comment

tristandruyen May 23, 2024 • edited Loading

Choose a reason for hiding this comment

tristandruyen commented Apr 23, 2024 •

edited

Loading

lukestanley commented Apr 24, 2024 •

edited

Loading

AayushG159 commented May 5, 2024 •

edited

Loading

ngxson May 22, 2024 •

edited

Loading

tristandruyen May 23, 2024 •

edited

Loading

tristandruyen May 23, 2024 •

edited

Loading

tristandruyen May 23, 2024 •

edited

Loading

tristandruyen May 23, 2024 •

edited

Loading