OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens #6859
Description
Commit: 4e96a81 (origin/master)
Expected Behavior: Chat completions from /v1/chat/completions
should not include the stop token in the text returned to the client
Actual Behavior: Stop token is included when using Mistral 7B instruct v0.2 and either no chat template, or the llama2
chat template.
Example of Broken Behavior
When I run inference with the server and mistral-7b-instruct-v0.2, I use the following command:
./server -m ~/Documents/AI/models/mistral-7b-instruct-v0.2.Q8_0.gguf -c 32768 -cb -np 1 -ngl -1 --host 0.0.0.0
The result of using the /v1/chat/completions
OpenAI endpoint with The Bloke's Quant of the model, includes the EOS </s>
string in the output:
This happens when I omit the --chat-template
option, and when I use --chat-template llama2
as indicated in this repository's wiki
In the past, when I have used chatml fine-tunes of mistral, I did not see a stop token at the end of the generated text.
However now, using the chatml-tuned Hermes 2 Pro Mistral 7B:
./server -m ~/Documents/AI/models/optimal/Hermes-2-Pro-Mistral-7B.Q8_0.gguf -cb -np 1 -c 8096 --host 0.0.0.0
, I see the <|im_end|>
stop token:
I am confident that I had never seen stop tokens included in chat completion response from the OpenAI compatible completions endpoint before with older versions of llama.cpp