-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens #6859
Comments
Hmm - this is unrelated to #6837 since that pertains to the default Server UI and the In fact, I'm running a ChatML instruct-tuned LLM (Nous Hermes 2 Solar 10.7B) in production on an older version of I did see #6847 but I opted to open this one anyways. They may or may not be the same issue (#6847 uses different models & does not use a chat template), and your title also indicate that it's for "old models" which isn't the case for the models that I'm using - all are recent models. I think it's also inaccurate to frame it as "gibberish" since it's not gibberish tokens, which is usually a separate problem. It's an issue with prompt template-related tokens, which means it's the chat template not being applied or parsed properly in the chat completions endpoint. Given this, I think it's best to either merge that issue into this, or to leave them both open. |
I agree to keep them separate. Almost all models old or new are impacted by the change. Reverting to older versions seems to work. |
@QueryType in your issue you said
Do you know what the most recent release that you are able to use the server without this issue is? If so we can work backwards to figure out the source of the issue, and I can try to create a PR to fix |
Yes, good idea, will do that. My hunch is (b2707 or b2702) introduced it. So b2700 should be fine. But honestly, I need to check. Will do in the evening today once I have access to the machine. |
Hmm it looks like it could've been either of these two merged PRs: https://github.com/ggerganov/llama.cpp/pull/6745/files - This PR deals with parsing EOG tokens so it could be this one. The other one, which is my guess, is this: It adds a parameter Seems like this is the most likely culprit, cc @ggerganov who was the author. Checking if changing this value stops the issue now. |
Yep that fixed it. cc @ggerganov, merging #6807 broke this behavior such that server now renders EOS/stop tokens. Changing |
also #6872 |
also #6873 |
Not only the |
Fixed in #6860 |
Commit: 4e96a81 (origin/master)
Expected Behavior: Chat completions from
/v1/chat/completions
should not include the stop token in the text returned to the clientActual Behavior: Stop token is included when using Mistral 7B instruct v0.2 and either no chat template, or the
llama2
chat template.Example of Broken Behavior
When I run inference with the server and mistral-7b-instruct-v0.2, I use the following command:
./server -m ~/Documents/AI/models/mistral-7b-instruct-v0.2.Q8_0.gguf -c 32768 -cb -np 1 -ngl -1 --host 0.0.0.0
The result of using the
/v1/chat/completions
OpenAI endpoint with The Bloke's Quant of the model, includes the EOS</s>
string in the output:This happens when I omit the
--chat-template
option, and when I use--chat-template llama2
as indicated in this repository's wikiIn the past, when I have used chatml fine-tunes of mistral, I did not see a stop token at the end of the generated text.
However now, using the chatml-tuned Hermes 2 Pro Mistral 7B:
./server -m ~/Documents/AI/models/optimal/Hermes-2-Pro-Mistral-7B.Q8_0.gguf -cb -np 1 -c 8096 --host 0.0.0.0
, I see the<|im_end|>
stop token:I am confident that I had never seen stop tokens included in chat completion response from the OpenAI compatible completions endpoint before with older versions of llama.cpp
The text was updated successfully, but these errors were encountered: