Tool calls support from mainline #723

firecoperana · 2025-08-23T19:32:28Z

This is kind of a startover of the existing tool calls support, which supports wider range of the models. The reason for this PR is that each model tends to need different templates for tool calls and the implementation is very easy to have bugs. This PR uses mainline's approach to handle tools calls, which makes it easy to support new models and fix existing/future bugs. It includes mainline's tool call related PRs up to 8/22/2205.
It's not thoroughly tested but in my simple tests, results look fine. If there are bugs, confirm if it exists in mainline first.

Changes to API:
Chat completions API result should match latest mainline.
/completions and /v1/completions will use openAI type completions.

The important mainline's PRs are:

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars ggml-org/llama.cpp#9639
server: streaming of tool calls and thoughts when --jinja is on (#12379)
server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) ggml-org/llama.cpp#13771
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client
Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client ggml-org/llama.cpp#13196
model : gpt-oss add response_format support model : gpt-oss add response_format support ggml-org/llama.cpp#15494

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

saood06 · 2025-08-24T04:40:54Z

examples/server/server.cpp

    svr->Post("/completion",          handle_completions); // legacy
-    svr->Post("/completions",         handle_completions);
-    svr->Post("/v1/completions",      handle_completions);
+    svr->Post("/completions", handle_completions_oai);


Why is "/completions" being changed to call the OAI compatible method handle_completions_oai?

I just checked mainline and isn't done there, is there any reason for this change as I don't really think it makes sense.

Also on a related topic, the handle_completions, I haven't gone through all of the code in this PR or tested the PR yet but based on my skimming of it so far, the code looks like it might pull in a change which on mainline ended up changing default behavior for the non-OAI endpoint in a way that I don't really agree with because it changes things for downstream apps that result in worse performance due to different outputs

See this comment: ggml-org/llama.cpp#10783 (comment)

You can see the downstream impacts it has due to:

lmg-anon/mikupad#104 and lmg-anon/mikupad#110

If this does make that change (feel free to tell me if it doesn't) I do feel like "post_sampling_probs" should default true, this way downstream applications can opt in to the new behavior but are not impacted otherwise (and mikupad isn't the only downstream app affected, but it was the only one I know impacted users noticed and said something).

Sorry if what I said is not relevant (I wanted to comment this now, since testing may take a while and not sure if I can do it tonight)

Edit: Just saw the tests in tools/server/tests/unit/test_completion.py and if this test case is true, then it confirms what I was asking about above.

Being able to have access to the logprobs if wanted via the non OAI endpoint is nice, but I don't think introducing that functionality should change the default behavior.

I know it is complicated in that mainline made this change so there have been two separate expectations. They also changed the output stream at the same time, which made the break obvious for affected users as it changed what you received when you also asked for n_probs on that endpoint. Not only that but it put a performance cost on an API parameter (n_probs) that did not have one before as the default (it could be removed if you pass post_sampling_probs as true, but that is a parameter that did not exist before).

I will change it back. Didn't realize it will break things.

ubergarm · 2025-08-25T15:28:56Z

Thanks for looking into this, I'm not exactly sure how to use it, but seems we need something like this to support enable/disable thinking for DeepSeek-V3.1 given how it handles that in the chat template: https://huggingface.co/deepseek-ai/DeepSeek-V3.1#chat-template

So as I understand it, /v1/* will use the newer OAI style endpoints, and the legacy stuff will remain at non /v1/ API endpoints?

One thing I did test which works on mainline but was not working here is this part:

server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client

It seems like --chat-template-kwargs is supported on mainline, but not this PR e.g.:

    --jinja \
    --chat-template-kwargs '{"thinking": true}' \

That said, I still don't know how to get mainline to haha, but that is for me to figure out a proper OAI client "prompt" or whatever data it wants.

Anyway, I'd love to see a working example of how to enable thinking for DeepSeek-V3.1 with this PR.

firecoperana · 2025-08-25T15:58:49Z

You are correct about api usage. I forgot to add --chat-template-kwargs for the command line args. Need to add it. Maybe try --reasoning-budget 0 and see if it works. This disables thinking.

firecoperana · 2025-08-26T01:34:15Z

--chat-template-kwargs is added

saood06 · 2025-08-26T03:12:50Z

tools/server/tests/unit/test_completion.py

+def test_n_probs():
+    global server
+    server.start()
+    res = server.make_request("POST", "/completion", data={
+        "prompt": "I believe the meaning of life is",
+        "n_probs": 10,
+        "temperature": 0.0,
+        "n_predict": 5,
+    })
+    assert res.status_code == 200
+    assert "completion_probabilities" in res.body
+    assert len(res.body["completion_probabilities"]) == 5
+    for tok in res.body["completion_probabilities"]:
+        assert "id" in tok and tok["id"] > 0
+        assert "token" in tok and type(tok["token"]) == str
+        assert "logprob" in tok and tok["logprob"] <= 0.0
+        assert "bytes" in tok and type(tok["bytes"]) == list
+        assert len(tok["top_logprobs"]) == 10
+        for prob in tok["top_logprobs"]:
+            assert "id" in prob and prob["id"] > 0
+            assert "token" in prob and type(prob["token"]) == str
+            assert "logprob" in prob and prob["logprob"] <= 0.0
+            assert "bytes" in prob and type(prob["bytes"]) == list


This is the specific test I was referring to before.

Assuming this test is accurate and passes this shows that the new default behavior is now to return top_logprobs which not only is different, but also as mentioned in my other comment based on what people have said comes at a performance penalty.

I just tested the current (without this PR) default behavior for n_probs (both in and outside of a stream).

It returned probs which is different from top_probs (which is what another test case says it will now return if you pass post_sampling_probs to be true), or top_logprobs (which is what this test case shows is the new default).

I don't see a benefit to this breaking change, n_probs returning probs by default makes sense to me, now allowing it to return logprobs by sending post_sampling_probs with false is also nice and is also not a breaking change.

Still trying to understand your issue here. Does this logprob issue occur when you use /v1/completions which has been changed to open ai compatible completions, or /completion and /completions. I actually removed probabilities in old completion end point. I can add them back if that is what you want.

For open ai compatible completions, mainline defaults to false for post_sampling_probs (can be set by client), but it is undefined here. If I default it to true and allow it to be set by client, will it solve the performance issue with logprob?

I actually removed probabilities in old completion end point. I can add them back if that is what you want.

~~I would appreciate that.~~

Edit: No you didn't. I actually tested it things work. That does mean this test file is wrong, so either should be updated or removed.

For open ai compatible completions, mainline defaults to false for post_sampling_probs (can be set by client), but it is undefined here. If I default it to true and allow it to be set by client, will it solve the performance issue with logprob?

Following whatever the openAI spec is for those endpoints makes most sense (as they are being provided as OAI compatible), but I was referring to the two non OAI endpoints to not make breaking changes there which this test says changed.

Still trying to understand your issue here.

This test file is wrong, so either should be updated or removed.

dpk-it · 2025-08-26T17:28:54Z

It looks like this PR is breaking usage/timings again. but I'm not sure, needs to be tested

ikawrakow · 2025-08-27T06:26:15Z

What is the status here? Are all disagreements resolved?

saood06 · 2025-08-27T07:06:37Z

What is the status here? Are all disagreements resolved?

The test files included in this PR would not pass when run (in this case it is the test that is wrong as this branch handles things the old way still). It does look like there may be a regression in usage/timings in some places.

I tested more and found another regression with token streaming off as with this PR the output does not contain any of the predicted tokens and also it is missing timings and usage. (I checked the KV cache with the new endpoint and was able to see it generated, just not output).

firecoperana · 2025-08-28T22:11:19Z

The new commit should fix all of the regression. I didn't look at the code in the tests.

dpk-it · 2025-08-28T22:55:59Z

examples/server/server.cpp

+        res.oaicompat_model = slot.oaicompat_model;
        res.data     = json {
            {"content",             !slot.params.stream ? slot.generated_text : ""},
-            {"generated_text",      slot.generated_text},  // Always include full text for finish_reason logic


Why did you remove "generated_text" from the final response? and "oaicompat_msg_diffs"... from /v1/completions endpoint? Some people may already be using this in their applications... In addition "usage" allows you to display statistics in "llama-swap" for this endpoint.

I meant old /v1/completions, which is now available at /completions

Usage is already added unless you need it for /completions. generated_text is added back. oaicompat_msg_diffs is added in prior tool call PR, so should be fine. Just in case, we can add a message stating breaking changes and let people change their API when needed, or fix any regression in future PRs.

saood06 · 2025-08-29T01:30:24Z

I didn't look at the code in the tests.

I haven't tested the code you just pushed in yet, but I see the tests are still in the PR. Merging in tests that do not reflect what is being merged in is just confusing. Can you either remove (or update the tests)?

firecoperana · 2025-08-29T13:35:29Z

They are removed.

ultoris · 2025-08-29T20:13:42Z

Regarding GPT-OSS response format support, when using the main branch or this PR the thinking part is not parsed (the <|channel|>analysis<|message|> part is shown as is in clients like openwebui or aider instead of as a thinking block). This does not happen in llama.cpp where the thinking block appears as expected.
I've used the same model and the raw model output seems to be identical in both cases, looks like there is something missing when parsing the harmony format response.

ikawrakow · 2025-08-30T05:24:23Z

@firecoperana Does #740 fix it?

firecoperana · 2025-08-30T12:56:55Z

@firecoperana Does #740 fix it?

Yes, now it's good. I will include this in my PR.

firecoperana · 2025-08-30T13:21:55Z

With #740, tool calls also works properly with GPT-OSS. To move thinking process in separate field like mainline, use --reasoning-format auto. The default here is --reasoning-format none, which puts reasoning and content both in content field. Built-in webui use none, and it parses thinking process and content itself.

oovloveme · 2025-08-30T15:01:47Z

Seems <think> was missing when running deepseek v3.1 on open-webui in this branch ( have </think> at the end of COT but no <think> at the beginning)

The model run with
--jinja
--chat-template-kwargs '{"thinking": true}' \

firecoperana · 2025-08-30T15:07:47Z

Can you add --reasoning-format auto and compare the result with mainline? I don't have deepseek v3.1 to test myself.

ChicoPinto70 · 2025-08-30T15:10:50Z

Seems "" was missing when running deepseek v3.1 on open-webui in this branch ( have "" at the end of COT but no "" at the beginning)

The model run with --jinja --chat-template-kwargs '{"thinking": true}' \

I'm having this problem, too. I can turn on reasoning in DS v3.1 with --jinja --chat-template-kwargs '{"thinking": true}' parameters, but the tag <think> doesn't appears in the beginning of the COT (just the </think> at the end).

I'm using ubergarm/DeepSeek-V3.1-smol-IQ4_KSS model.

firecoperana · 2025-08-30T15:34:15Z

Seems "" was missing when running deepseek v3.1 on open-webui in this branch ( have "" at the end of COT but no "" at the beginning)
The model run with --jinja --chat-template-kwargs '{"thinking": true}' \

I'm having this problem, too. I can turn on reasoning in DS v3.1 with --jinja --chat-template-kwargs '{"thinking": true}' parameters, but the tag doesn't appears in the beginning of the COT (just the at the end).

I'm using ubergarm/DeepSeek-V3.1-smol-IQ4_KSS model.

Found this. ggml-org/llama.cpp#15496
Wait for the fix to be merged in mainline

ChicoPinto70 · 2025-08-30T15:46:47Z

Seems "" was missing when running deepseek v3.1 on open-webui in this branch ( have "" at the end of COT but no "" at the beginning)
The model run with --jinja --chat-template-kwargs '{"thinking": true}' \

I'm having this problem, too. I can turn on reasoning in DS v3.1 with --jinja --chat-template-kwargs '{"thinking": true}' parameters, but the tag doesn't appears in the beginning of the COT (just the at the end).
I'm using ubergarm/DeepSeek-V3.1-smol-IQ4_KSS model.

Found this. ggml-org/llama.cpp#15496 Wait for the fix to be merged in mainline

OK! I've just tried to run the mainline (it was a long time that I don't use it.) but it doesn't support this ubergarm model. 👍

ikawrakow

If there are no further objections I'll merge this tomorrow morning (my time).

saood06 · 2025-09-01T05:04:11Z

If there are no further objections I'll merge this tomorrow morning (my time).

I don't see any regressions anymore so no objection from me.

arichiardi · 2025-09-05T23:33:21Z

Haven't checked if included here but today this fix was merged into mainline (fixes a bug with enable_thinking kwarg logic)

ggml-org/llama.cpp#15404

firecoperana · 2025-09-08T20:15:48Z

Haven't checked if included here but today this fix was merged into mainline (fixes a bug with enable_thinking kwarg logic)

ggml-org/llama.cpp#15404

Deepseek v3.1 tool calling just merged in mainline. Will include the above PR when porting it.

firecoperana · 2025-09-09T23:58:25Z

PR with Deepseek v3.1 tool calling

@arichiardi Thinking model disabled assistant prefill is included in the above PR.

This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783. That PR and friends were partially cherry-picked by ikawrakow#723, but wasn't really in a working state yet. A couple of additional changes: * Include timing information in response, which was (unintentionally?) done in mainline since ggml-org/llama.cpp#10643. * Also return the actual logprobs for accepted draft tokens. This is still a TODO in mainline [1]. Note that there is a TG performance penalty to return the logprobs. Here are some numbers I got with Qwen2.5-Coder-32B-Instruct: * no draft, no logprobs: 12.81 tok/s * no draft, with logprobs: 12.02 tok/s (6.2% drop) * with draft, no logprobs: 36.59 tok/s * with draft, with logprobs: 29.08 tok/s (20.5% drop) [1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019

This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus optimization to do partial sort when sorting the logits. That mainline PR and friends were partially cherry-picked by ikawrakow#723, but wasn't really in a working state yet. A couple of additional changes: * Include timing information in response, which was (unintentionally?) done in mainline since ggml-org/llama.cpp#10643. * Also return the actual logprobs for accepted draft tokens. This is still a TODO in mainline [1]. Note that there is a TG performance penalty to return the logprobs, as we need to sort the logits. By doing partial sort, the penalty is quite small. Here are some numbers I got using the same prompt: This PR with partial sort: * no draft, no logprobs: 12.87 tok/s * no draft, with logprobs: 12.61 tok/s (2.0% drop) * with draft, no logprobs: 36.74 tok/s * with draft, with logprobs: 36.12 tok/s (1.7% drop) If cherry-pick the full sort from mainline PR: * no draft, no logprobs: 12.81 tok/s * no draft, with logprobs: 12.02 tok/s (6.2% drop) * with draft, no logprobs: 36.59 tok/s * with draft, with logprobs: 29.08 tok/s (20.5% drop) [1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus optimization to do partial sort when sorting the logits. That mainline PR and friends were partially cherry-picked by #723, but wasn't really in a working state yet. A couple of additional changes: * Include timing information in response, which was (unintentionally?) done in mainline since ggml-org/llama.cpp#10643. * Also return the actual logprobs for accepted draft tokens. This is still a TODO in mainline [1]. Note that there is a TG performance penalty to return the logprobs, as we need to sort the logits. By doing partial sort, the penalty is quite small. Here are some numbers I got using the same prompt: This PR with partial sort: * no draft, no logprobs: 12.87 tok/s * no draft, with logprobs: 12.61 tok/s (2.0% drop) * with draft, no logprobs: 36.74 tok/s * with draft, with logprobs: 36.12 tok/s (1.7% drop) If cherry-pick the full sort from mainline PR: * no draft, no logprobs: 12.81 tok/s * no draft, with logprobs: 12.02 tok/s (6.2% drop) * with draft, no logprobs: 36.59 tok/s * with draft, with logprobs: 29.08 tok/s (20.5% drop) [1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Tool calls support from mainline

c635912

firecoperana requested review from ikawrakow and saood06 August 23, 2025 19:32

firecoperana self-assigned this Aug 23, 2025

update cmake

b66e73b

saood06 reviewed Aug 24, 2025

View reviewed changes

revert api for /completions

33e3c1f

ikawrakow requested a review from saood06 August 24, 2025 12:53

firecoperana and others added 2 commits August 24, 2025 13:08

Merge branch 'main' into fcp/tool_call_support

fad0827

Fix broken thinking process for gpt-oss

506f85d

add missing args and fix webui bugs

49a088c

firecoperana added 2 commits August 25, 2025 20:36

add missing args and fix webui bugs2

0c4e2cc

Fix reasoning format error

5877b06

firecoperana mentioned this pull request Aug 26, 2025

Bug: gpt-oss chat template error when using --jinja flag. #693

Closed

saood06 reviewed Aug 26, 2025

View reviewed changes

add usage

460658b

change default post_sampling_probs to true

81097fb

dpk-it reviewed Aug 28, 2025

View reviewed changes

add back generated_text

17c9bec

Remove server endpoints tests

ffd0f40

firecoperana and others added 2 commits August 30, 2025 07:47

add log

1fa9b2b

Chat fixes

41223f9

Remove logs

d690068

webui: revert extra handling of thinking process

94b58d3

ikawrakow approved these changes Aug 31, 2025

View reviewed changes

ikawrakow merged commit 0f9ecae into main Sep 1, 2025

This was referenced Sep 1, 2025

Chat fixes #740

Closed

Feature Request: Port --reasoning-budget from main llamacpp (llamaserver) #600

Closed

Feature Request: Support for function calling in llama-server #407

Closed

ikawrakow mentioned this pull request Sep 4, 2025

Bug: " Tool calls support from mainline" patch causes VRAM overflow for models that worked before #756

Closed

sayap mentioned this pull request Sep 22, 2025

Fix logprobs #787

Merged

4 tasks

firecoperana deleted the fcp/tool_call_support branch October 26, 2025 16:57

jarrodfeaks mentioned this pull request Oct 28, 2025

Fix assistant prefill in OAI chat API #874

Merged

4 tasks

Uh oh!

Tool calls support from mainline #723

Tool calls support from mainline #723

Conversation

firecoperana commented Aug 23, 2025

Uh oh!

saood06 Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

firecoperana Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

ubergarm commented Aug 25, 2025

Uh oh!

firecoperana commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Aug 26, 2025

Uh oh!

saood06 Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

firecoperana Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

saood06 Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dpk-it commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Aug 27, 2025

Uh oh!

saood06 commented Aug 27, 2025

Uh oh!

firecoperana commented Aug 28, 2025

Uh oh!

dpk-it Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dpk-it Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

firecoperana Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

saood06 commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ultoris commented Aug 29, 2025

Uh oh!

ikawrakow commented Aug 30, 2025

Uh oh!

firecoperana commented Aug 30, 2025

Uh oh!

firecoperana commented Aug 30, 2025

Uh oh!

oovloveme commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Aug 30, 2025

Uh oh!

ChicoPinto70 commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Aug 30, 2025

Uh oh!

ChicoPinto70 commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow left a comment

Choose a reason for hiding this comment

Uh oh!

saood06 Aug 24, 2025 •

edited

Loading

firecoperana commented Aug 25, 2025 •

edited

Loading

saood06 Aug 26, 2025 •

edited

Loading

saood06 Aug 26, 2025 •

edited

Loading

dpk-it commented Aug 26, 2025 •

edited

Loading

dpk-it Aug 28, 2025 •

edited

Loading

saood06 commented Aug 29, 2025 •

edited

Loading

firecoperana commented Aug 29, 2025 •

edited

Loading

oovloveme commented Aug 30, 2025 •

edited

Loading

ChicoPinto70 commented Aug 30, 2025 •

edited

Loading

ChicoPinto70 commented Aug 30, 2025 •

edited

Loading

saood06 commented Sep 1, 2025 •

edited

Loading