Skip to content

Conversation

@firecoperana
Copy link
Collaborator

This is kind of a startover of the existing tool calls support, which supports wider range of the models. The reason for this PR is that each model tends to need different templates for tool calls and the implementation is very easy to have bugs. This PR uses mainline's approach to handle tools calls, which makes it easy to support new models and fix existing/future bugs. It includes mainline's tool call related PRs up to 8/22/2205.
It's not thoroughly tested but in my simple tests, results look fine. If there are bugs, confirm if it exists in mainline first.

Changes to API:
Chat completions API result should match latest mainline.
/completions and /v1/completions will use openAI type completions.

The important mainline's PRs are:

  1. Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars ggml-org/llama.cpp#9639
  2. server: streaming of tool calls and thoughts when --jinja is on (#12379)
  3. server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) ggml-org/llama.cpp#13771
  4. server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client
    Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client ggml-org/llama.cpp#13196
  5. model : gpt-oss add response_format support model : gpt-oss add response_format support ggml-org/llama.cpp#15494

@firecoperana firecoperana self-assigned this Aug 23, 2025
svr->Post("/completion", handle_completions); // legacy
svr->Post("/completions", handle_completions);
svr->Post("/v1/completions", handle_completions);
svr->Post("/completions", handle_completions_oai);
Copy link
Collaborator

@saood06 saood06 Aug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is "/completions" being changed to call the OAI compatible method handle_completions_oai?

I just checked mainline and isn't done there, is there any reason for this change as I don't really think it makes sense.

Also on a related topic, the handle_completions, I haven't gone through all of the code in this PR or tested the PR yet but based on my skimming of it so far, the code looks like it might pull in a change which on mainline ended up changing default behavior for the non-OAI endpoint in a way that I don't really agree with because it changes things for downstream apps that result in worse performance due to different outputs

See this comment: ggml-org/llama.cpp#10783 (comment)

You can see the downstream impacts it has due to:

lmg-anon/mikupad#104 and lmg-anon/mikupad#110

If this does make that change (feel free to tell me if it doesn't) I do feel like "post_sampling_probs" should default true, this way downstream applications can opt in to the new behavior but are not impacted otherwise (and mikupad isn't the only downstream app affected, but it was the only one I know impacted users noticed and said something).

Sorry if what I said is not relevant (I wanted to comment this now, since testing may take a while and not sure if I can do it tonight)

Edit: Just saw the tests in tools/server/tests/unit/test_completion.py and if this test case is true, then it confirms what I was asking about above.

Being able to have access to the logprobs if wanted via the non OAI endpoint is nice, but I don't think introducing that functionality should change the default behavior.

I know it is complicated in that mainline made this change so there have been two separate expectations. They also changed the output stream at the same time, which made the break obvious for affected users as it changed what you received when you also asked for n_probs on that endpoint. Not only that but it put a performance cost on an API parameter (n_probs) that did not have one before as the default (it could be removed if you pass post_sampling_probs as true, but that is a parameter that did not exist before).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it back. Didn't realize it will break things.

@ikawrakow ikawrakow requested a review from saood06 August 24, 2025 12:53
@ubergarm
Copy link
Contributor

Thanks for looking into this, I'm not exactly sure how to use it, but seems we need something like this to support enable/disable thinking for DeepSeek-V3.1 given how it handles that in the chat template: https://huggingface.co/deepseek-ai/DeepSeek-V3.1#chat-template

So as I understand it, /v1/* will use the newer OAI style endpoints, and the legacy stuff will remain at non /v1/ API endpoints?

One thing I did test which works on mainline but was not working here is this part:

server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client

It seems like --chat-template-kwargs is supported on mainline, but not this PR e.g.:

    --jinja \
    --chat-template-kwargs '{"thinking": true}' \

That said, I still don't know how to get mainline to haha, but that is for me to figure out a proper OAI client "prompt" or whatever data it wants.

Anyway, I'd love to see a working example of how to enable thinking for DeepSeek-V3.1 with this PR.

@firecoperana
Copy link
Collaborator Author

firecoperana commented Aug 25, 2025

You are correct about api usage. I forgot to add --chat-template-kwargs for the command line args. Need to add it. Maybe try --reasoning-budget 0 and see if it works. This disables thinking.

@firecoperana
Copy link
Collaborator Author

--chat-template-kwargs is added

Comment on lines 367 to 389
def test_n_probs():
global server
server.start()
res = server.make_request("POST", "/completion", data={
"prompt": "I believe the meaning of life is",
"n_probs": 10,
"temperature": 0.0,
"n_predict": 5,
})
assert res.status_code == 200
assert "completion_probabilities" in res.body
assert len(res.body["completion_probabilities"]) == 5
for tok in res.body["completion_probabilities"]:
assert "id" in tok and tok["id"] > 0
assert "token" in tok and type(tok["token"]) == str
assert "logprob" in tok and tok["logprob"] <= 0.0
assert "bytes" in tok and type(tok["bytes"]) == list
assert len(tok["top_logprobs"]) == 10
for prob in tok["top_logprobs"]:
assert "id" in prob and prob["id"] > 0
assert "token" in prob and type(prob["token"]) == str
assert "logprob" in prob and prob["logprob"] <= 0.0
assert "bytes" in prob and type(prob["bytes"]) == list
Copy link
Collaborator

@saood06 saood06 Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the specific test I was referring to before.

Assuming this test is accurate and passes this shows that the new default behavior is now to return top_logprobs which not only is different, but also as mentioned in my other comment based on what people have said comes at a performance penalty.

I just tested the current (without this PR) default behavior for n_probs (both in and outside of a stream).

It returned probs which is different from top_probs (which is what another test case says it will now return if you pass post_sampling_probs to be true), or top_logprobs (which is what this test case shows is the new default).

I don't see a benefit to this breaking change, n_probs returning probs by default makes sense to me, now allowing it to return logprobs by sending post_sampling_probs with false is also nice and is also not a breaking change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still trying to understand your issue here. Does this logprob issue occur when you use /v1/completions which has been changed to open ai compatible completions, or /completion and /completions. I actually removed probabilities in old completion end point. I can add them back if that is what you want.

For open ai compatible completions, mainline defaults to false for post_sampling_probs (can be set by client), but it is undefined here. If I default it to true and allow it to be set by client, will it solve the performance issue with logprob?

Copy link
Collaborator

@saood06 saood06 Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually removed probabilities in old completion end point. I can add them back if that is what you want.

I would appreciate that.

Edit: No you didn't. I actually tested it things work. That does mean this test file is wrong, so either should be updated or removed.

For open ai compatible completions, mainline defaults to false for post_sampling_probs (can be set by client), but it is undefined here. If I default it to true and allow it to be set by client, will it solve the performance issue with logprob?

Following whatever the openAI spec is for those endpoints makes most sense (as they are being provided as OAI compatible), but I was referring to the two non OAI endpoints to not make breaking changes there which this test says changed.

Still trying to understand your issue here.

This test file is wrong, so either should be updated or removed.

@dpk-it
Copy link
Contributor

dpk-it commented Aug 26, 2025

It looks like this PR is breaking usage/timings again. but I'm not sure, needs to be tested

@ikawrakow
Copy link
Owner

What is the status here? Are all disagreements resolved?

@saood06
Copy link
Collaborator

saood06 commented Aug 27, 2025

What is the status here? Are all disagreements resolved?

The test files included in this PR would not pass when run (in this case it is the test that is wrong as this branch handles things the old way still). It does look like there may be a regression in usage/timings in some places.

I tested more and found another regression with token streaming off as with this PR the output does not contain any of the predicted tokens and also it is missing timings and usage. (I checked the KV cache with the new endpoint and was able to see it generated, just not output).

@firecoperana
Copy link
Collaborator Author

The new commit should fix all of the regression. I didn't look at the code in the tests.

res.oaicompat_model = slot.oaicompat_model;
res.data = json {
{"content", !slot.params.stream ? slot.generated_text : ""},
{"generated_text", slot.generated_text}, // Always include full text for finish_reason logic
Copy link
Contributor

@dpk-it dpk-it Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove "generated_text" from the final response? and "oaicompat_msg_diffs"... from /v1/completions endpoint? Some people may already be using this in their applications... In addition "usage" allows you to display statistics in "llama-swap" for this endpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant old /v1/completions, which is now available at /completions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage is already added unless you need it for /completions. generated_text is added back. oaicompat_msg_diffs is added in prior tool call PR, so should be fine. Just in case, we can add a message stating breaking changes and let people change their API when needed, or fix any regression in future PRs.

@saood06
Copy link
Collaborator

saood06 commented Aug 29, 2025

I didn't look at the code in the tests.

I haven't tested the code you just pushed in yet, but I see the tests are still in the PR. Merging in tests that do not reflect what is being merged in is just confusing. Can you either remove (or update the tests)?

@firecoperana
Copy link
Collaborator Author

firecoperana commented Aug 29, 2025

They are removed.

@ultoris
Copy link

ultoris commented Aug 29, 2025

Regarding GPT-OSS response format support, when using the main branch or this PR the thinking part is not parsed (the <|channel|>analysis<|message|> part is shown as is in clients like openwebui or aider instead of as a thinking block). This does not happen in llama.cpp where the thinking block appears as expected.
I've used the same model and the raw model output seems to be identical in both cases, looks like there is something missing when parsing the harmony format response.

@ikawrakow
Copy link
Owner

@firecoperana Does #740 fix it?

firecoperana and others added 2 commits August 30, 2025 07:47
@firecoperana
Copy link
Collaborator Author

@firecoperana Does #740 fix it?

Yes, now it's good. I will include this in my PR.

@firecoperana
Copy link
Collaborator Author

With #740, tool calls also works properly with GPT-OSS. To move thinking process in separate field like mainline, use --reasoning-format auto. The default here is --reasoning-format none, which puts reasoning and content both in content field. Built-in webui use none, and it parses thinking process and content itself.

@oovloveme
Copy link

oovloveme commented Aug 30, 2025

Seems <think> was missing when running deepseek v3.1 on open-webui in this branch ( have </think> at the end of COT but no <think> at the beginning)

The model run with
--jinja
--chat-template-kwargs '{"thinking": true}' \

@firecoperana
Copy link
Collaborator Author

Can you add --reasoning-format auto and compare the result with mainline? I don't have deepseek v3.1 to test myself.

@ChicoPinto70
Copy link

ChicoPinto70 commented Aug 30, 2025

Seems "" was missing when running deepseek v3.1 on open-webui in this branch ( have "" at the end of COT but no "" at the beginning)

The model run with --jinja --chat-template-kwargs '{"thinking": true}' \

I'm having this problem, too. I can turn on reasoning in DS v3.1 with --jinja --chat-template-kwargs '{"thinking": true}' parameters, but the tag <think> doesn't appears in the beginning of the COT (just the </think> at the end).

I'm using ubergarm/DeepSeek-V3.1-smol-IQ4_KSS model.

@firecoperana
Copy link
Collaborator Author

Seems "" was missing when running deepseek v3.1 on open-webui in this branch ( have "" at the end of COT but no "" at the beginning)
The model run with --jinja --chat-template-kwargs '{"thinking": true}' \

I'm having this problem, too. I can turn on reasoning in DS v3.1 with --jinja --chat-template-kwargs '{"thinking": true}' parameters, but the tag doesn't appears in the beginning of the COT (just the at the end).

I'm using ubergarm/DeepSeek-V3.1-smol-IQ4_KSS model.

Found this. ggml-org/llama.cpp#15496
Wait for the fix to be merged in mainline

@ChicoPinto70
Copy link

ChicoPinto70 commented Aug 30, 2025

Seems "" was missing when running deepseek v3.1 on open-webui in this branch ( have "" at the end of COT but no "" at the beginning)
The model run with --jinja --chat-template-kwargs '{"thinking": true}' \

I'm having this problem, too. I can turn on reasoning in DS v3.1 with --jinja --chat-template-kwargs '{"thinking": true}' parameters, but the tag doesn't appears in the beginning of the COT (just the at the end).
I'm using ubergarm/DeepSeek-V3.1-smol-IQ4_KSS model.

Found this. ggml-org/llama.cpp#15496 Wait for the fix to be merged in mainline

OK! I've just tried to run the mainline (it was a long time that I don't use it.) but it doesn't support this ubergarm model. 👍

Copy link
Owner

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no further objections I'll merge this tomorrow morning (my time).

@saood06
Copy link
Collaborator

saood06 commented Sep 1, 2025

If there are no further objections I'll merge this tomorrow morning (my time).

I don't see any regressions anymore so no objection from me.

@arichiardi
Copy link

Haven't checked if included here but today this fix was merged into mainline (fixes a bug with enable_thinking kwarg logic)

ggml-org/llama.cpp#15404

@firecoperana
Copy link
Collaborator Author

Haven't checked if included here but today this fix was merged into mainline (fixes a bug with enable_thinking kwarg logic)

ggml-org/llama.cpp#15404

Deepseek v3.1 tool calling just merged in mainline. Will include the above PR when porting it.

@firecoperana
Copy link
Collaborator Author

PR with Deepseek v3.1 tool calling

@arichiardi Thinking model disabled assistant prefill is included in the above PR.

sayap added a commit to sayap/ik_llama.cpp that referenced this pull request Sep 22, 2025
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783.

That PR and friends were partially cherry-picked by ikawrakow#723, but wasn't
really in a working state yet.

A couple of additional changes:
* Include timing information in response, which was (unintentionally?)
  done in mainline since ggml-org/llama.cpp#10643.
* Also return the actual logprobs for accepted draft tokens. This is
  still a TODO in mainline [1].

Note that there is a TG performance penalty to return the logprobs.
Here are some numbers I got with Qwen2.5-Coder-32B-Instruct:
* no draft, no logprobs: 12.81 tok/s
* no draft, with logprobs: 12.02 tok/s (6.2% drop)
* with draft, no logprobs: 36.59 tok/s
* with draft, with logprobs: 29.08 tok/s (20.5% drop)

[1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019
@sayap sayap mentioned this pull request Sep 22, 2025
4 tasks
sayap added a commit to sayap/ik_llama.cpp that referenced this pull request Sep 24, 2025
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus
optimization to do partial sort when sorting the logits.

That mainline PR and friends were partially cherry-picked by ikawrakow#723, but
wasn't really in a working state yet.

A couple of additional changes:
* Include timing information in response, which was (unintentionally?)
  done in mainline since ggml-org/llama.cpp#10643.
* Also return the actual logprobs for accepted draft tokens. This is
  still a TODO in mainline [1].

Note that there is a TG performance penalty to return the logprobs, as
we need to sort the logits. By doing partial sort, the penalty is quite
small. Here are some numbers I got using the same prompt:

This PR with partial sort:
* no draft, no logprobs: 12.87 tok/s
* no draft, with logprobs: 12.61 tok/s (2.0% drop)
* with draft, no logprobs: 36.74 tok/s
* with draft, with logprobs: 36.12 tok/s (1.7% drop)

If cherry-pick the full sort from mainline PR:
* no draft, no logprobs: 12.81 tok/s
* no draft, with logprobs: 12.02 tok/s (6.2% drop)
* with draft, no logprobs: 36.59 tok/s
* with draft, with logprobs: 29.08 tok/s (20.5% drop)

[1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
sayap added a commit to sayap/ik_llama.cpp that referenced this pull request Sep 24, 2025
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus
optimization to do partial sort when sorting the logits.

That mainline PR and friends were partially cherry-picked by ikawrakow#723, but
wasn't really in a working state yet.

A couple of additional changes:
* Include timing information in response, which was (unintentionally?)
  done in mainline since ggml-org/llama.cpp#10643.
* Also return the actual logprobs for accepted draft tokens. This is
  still a TODO in mainline [1].

Note that there is a TG performance penalty to return the logprobs, as
we need to sort the logits. By doing partial sort, the penalty is quite
small. Here are some numbers I got using the same prompt:

This PR with partial sort:
* no draft, no logprobs: 12.87 tok/s
* no draft, with logprobs: 12.61 tok/s (2.0% drop)
* with draft, no logprobs: 36.74 tok/s
* with draft, with logprobs: 36.12 tok/s (1.7% drop)

If cherry-pick the full sort from mainline PR:
* no draft, no logprobs: 12.81 tok/s
* no draft, with logprobs: 12.02 tok/s (6.2% drop)
* with draft, no logprobs: 36.59 tok/s
* with draft, with logprobs: 29.08 tok/s (20.5% drop)

[1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
sayap added a commit to sayap/ik_llama.cpp that referenced this pull request Sep 25, 2025
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus
optimization to do partial sort when sorting the logits.

That mainline PR and friends were partially cherry-picked by ikawrakow#723, but
wasn't really in a working state yet.

A couple of additional changes:
* Include timing information in response, which was (unintentionally?)
  done in mainline since ggml-org/llama.cpp#10643.
* Also return the actual logprobs for accepted draft tokens. This is
  still a TODO in mainline [1].

Note that there is a TG performance penalty to return the logprobs, as
we need to sort the logits. By doing partial sort, the penalty is quite
small. Here are some numbers I got using the same prompt:

This PR with partial sort:
* no draft, no logprobs: 12.87 tok/s
* no draft, with logprobs: 12.61 tok/s (2.0% drop)
* with draft, no logprobs: 36.74 tok/s
* with draft, with logprobs: 36.12 tok/s (1.7% drop)

If cherry-pick the full sort from mainline PR:
* no draft, no logprobs: 12.81 tok/s
* no draft, with logprobs: 12.02 tok/s (6.2% drop)
* with draft, no logprobs: 36.59 tok/s
* with draft, with logprobs: 29.08 tok/s (20.5% drop)

[1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
ikawrakow pushed a commit that referenced this pull request Sep 25, 2025
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus
optimization to do partial sort when sorting the logits.

That mainline PR and friends were partially cherry-picked by #723, but
wasn't really in a working state yet.

A couple of additional changes:
* Include timing information in response, which was (unintentionally?)
  done in mainline since ggml-org/llama.cpp#10643.
* Also return the actual logprobs for accepted draft tokens. This is
  still a TODO in mainline [1].

Note that there is a TG performance penalty to return the logprobs, as
we need to sort the logits. By doing partial sort, the penalty is quite
small. Here are some numbers I got using the same prompt:

This PR with partial sort:
* no draft, no logprobs: 12.87 tok/s
* no draft, with logprobs: 12.61 tok/s (2.0% drop)
* with draft, no logprobs: 36.74 tok/s
* with draft, with logprobs: 36.12 tok/s (1.7% drop)

If cherry-pick the full sort from mainline PR:
* no draft, no logprobs: 12.81 tok/s
* no draft, with logprobs: 12.02 tok/s (6.2% drop)
* with draft, no logprobs: 36.59 tok/s
* with draft, with logprobs: 29.08 tok/s (20.5% drop)

[1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@firecoperana firecoperana deleted the fcp/tool_call_support branch October 26, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants