Skip to content

tool-call: Phi-4 support #12288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

tool-call: Phi-4 support #12288

wants to merge 21 commits into from

Conversation

jpohhhh
Copy link

@jpohhhh jpohhhh commented Mar 9, 2025

  • Add system message if needed (per template requirement)
  • Add tools to system message (req'd by template)
  • Parse output: -- add tools to response when there is valid JSON between <|tool_call|> and </|tool_call|> -- content outside of tool_call tags is added to the text portion of the response -- if there is no valid JSON, the entire content is added to the text portion of the response

Make sure to read the contributing guidelines before submitting a PR

(cc @ochafik)

@github-actions github-actions bot added the testing Everything test related label Mar 9, 2025
@ngxson ngxson requested a review from ochafik March 9, 2025 20:39
common/chat.cpp Outdated
if (!found_system_msg && !adjusted_messages.empty()) {
json system_msg = {
{"role", "system"},
{"content", "You are a helpful assistant with access to tools.\nTo use a tool, respond in this format: <|tool_call|>{\"name\": \"foo\", \"arguments\": {\"a\": 1}}<|/tool_call|>"},
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch <|/tool_call|> to </|tool_call|>

Well, rather: quadruple check which one is correct via the special tokens, & make sure its consistent here and @ 1379

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just ditch c++ reimplementation of a template in favour of models/templates/llama-cpp-microsoft-Phi-4-mini-instruct.jinja

Could you please add tests cases for phi-4 in examples/server/tests/unit/test_tool_call.py? (w/ the chatml template, and with my proposed template - which will need adjustments; comment out the other tests + define LLAMA_CACHE & DEBUG before running if you wanna save disk space and bandwidth)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reopening in case you missed the suggestion to add a test in examples/server/tests/unit/test_tool_call.py . I can also do it as a follow up / doesn't have to block this (looks and runs well!), but would be good.

@ochafik
Copy link
Collaborator

ochafik commented Mar 9, 2025

Hey @jpohhhh, thanks a lot for preparing this!

The official template from Microsoft is quite disappointing tbh, and while your changes work around some/most of its limitations, we might need a bit more / might be worth going full jinja (see below)

Show original template
{%- for message in messages -%}
  {%- if message['role'] == 'system' and 'tools' in message and message['tools'] is not none -%}
    {{- '<|' + message['role'] + '|>' + message['content'] + '<|tool|>' + message['tools'] + '<|/tool|>' + '<|end|>' -}}
  {%- else -%}
    {{- '<|' + message['role'] + '|>' + message['content'] + '<|end|>' -}}
  {%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
  {{- '<|assistant|>' -}}
{%- else -%}
  {{- eos_token -}}
{%- endif -%}

The "sins" of their template are:

  • It expects tools from the system message (instead of as a global variable as most templates). This is worked around by Minja w/ a polyfill that prints the json array of tools. Unfortunately that's w/o the expected <|tool|>...</|tool|> wrapper, the "tools in message" behaviour should be handled in https://github.com/google/minja
  • It does not print tool calls (this is worked around by the Minja + the generic mode, but without the <|tool_call|> syntax)
  • It prints tool call results (messages such as {"role": "tool", "name": "foo", "content": "42"}) as <|tool|>42<|end|>... which would probably conflict with the tool description wrapping mechanism above. Unclear what the proper way to inject tool results is (did you find any documentation btw?), but sure involves <|tool_response|> (see Phi-4-mini-instruct's added tokens; possibly also <|tag|>). Note that Minja doesn't polyfill this but the generic handler does.

Despite these issues, I seem to be getting good outputs w/ the generic handling on master:

cmake -B build -DLLAMA_CURL=1 && cmake --build build --config Release -j -t llama-server

export LLAMA_SERVER_BIN_PATH=$PWD/build/bin/llama-server
export LLAMA_CACHE=${LLAMA_CACHE:-$HOME/Library/Caches/llama.cpp}

./scripts/tool_bench.py run --n 10 --temp -1 --temp 0 --temp 1 --test-calc-result --model "Phi 4 mini instruct Q4_K_M" --output phi4_master.jsonl --hf bartowski/microsoft_Phi-4-mini-instruct-GGUF
./scripts/tool_bench.py plot phi4_master.jsonl

This is just a smoke test / not a proper benchmark (trying to get BFCL running, see here), but I'm getting less success w/ your branch.

I sent you Telosnex#1 which fixes a couple of issues.

I think we should also remove much of the code in common_chat_params_init_phi_4 and heavily suggest users use a better template instead (if someone at Microsoft is reading, please update your template haha!). We could even throw an exception w/ instructions to use the "right" template when we detect a bad template.

Show proposed template (still unclear how to provide `<|tool_response|>` and if `<|tag|>` should be involved)
{%- if messages[0]["role"] == "system" %}
    {%- set system_message = messages[0]["content"] %}
{% elif tools is defined -%}
    {%- set system_message = "You are a helpful assistant with access to tools." -%}
{% else %}
    {%- set system_message = "" -%}
{%- endif %}
{%- if tools is defined -%}
    {%- set system_message = system_message + '<|tool|>' + (tools | tojson) + '<|/tool|>' -%}
    {%- if '<|tool_call|>' not in system_message -%}
        {%- set system_message = system_message + "\nTo use a tool, respond in this format: <|tool_call|>{\"name\": \"foo\", \"arguments\": {\"a\": 1}}<|/tool_call|>" %}
    {%- endif %}
{%- endif %}
{%- if system_message is defined -%}
    {{- '<|system|>' + system_message + '<|end|>' -}}
{%- endif -%}
{%- for message in messages -%}
    {%- if message['role'] == 'tool' -%}
        {{- '<|tool_response|>' + (message['content'] | tojson) + '<|/tool_response|>' -}}
    {%- elif message['role'] != 'system' -%}
        {{- '<|' + message['role'] + '|>' -}}
        {%- if message.content -%}
            {{- message['content'] -}}
        {%- endif -%}  
        {%- for tool_call in message.tool_calls -%}
            {{- '<|tool_call|>' + (tool_call | tojson) + '<|/tool_call|>' -}}
        {%- endfor -%}
        {{- '<|end|>' -}}
    {%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
   {{- '<|assistant|>' -}}
{%- else -%}
   {{- eos_token -}}
{%- endif -%}

@jpohhhh jpohhhh requested a review from ngxson as a code owner March 14, 2025 22:10
@github-actions github-actions bot added documentation Improvements or additions to documentation examples server labels Mar 14, 2025
@github-actions github-actions bot added android Issues specific to Android Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend python python script changes devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 14, 2025
jpohhhh added 3 commits March 14, 2025 23:11
- Add system message if needed (per template requirement)
- Add tools to system message (req'd by template)
- Parse output:
-- add tools to response when there is valid JSON between <|tool_call|> and </|tool_call|>
-- content outside of tool_call tags is added to the text portion of the response
-- if there is no valid JSON, the entire content is added to the text portion of the response
I made a mess while merging in Olivier's work, so it ended up merged
into one commit in this branch. In this commit, I undo changes that
wouldn't have been intended in this commit (ex. server.cpp
jpohhhh added 2 commits March 15, 2025 00:02
via https://huggingface.co/microsoft/Phi-4-mini-instruct/blob/main/added_tokens.json
{
  "<|/tool_call|>": 200026,
  "<|/tool|>": 200024,
  "<|assistant|>": 200019,
  "<|end|>": 200020,
  "<|system|>": 200022,
  "<|tag|>": 200028,
  "<|tool_call|>": 200025,
  "<|tool_response|>": 200027,
  "<|tool|>": 200023,
  "<|user|>": 200021
}

FWIW tool_response seems to be a role, via https://github.com/kinfey/Phi-3CookBook/blob/main/md/02.Application/07.FunctionCalling/Phi4/Multiagents/Phi_4_mini_multiagent.ipynb
Phi-4 tool calling's docs include the HuggingFace pages for Phi-4[^1] and Jupyter notebooks at [^2]. [^1] only has <tool_call>, and the notebook at [^3] is the only resource I've found that demonstrates tool responses.

It looks like it's used as a sort of role -- it doesn't directly show it 100% directly, but, the message format used defines tool_response as a role in a structure with text content, identical to the user/assistant messages in the same notebook.

Given that, and also it explaining another small mystery to me (why isn't <|/tool_response> in the reserved tokens?), I'll apply that here.

[^1] https://huggingface.co/microsoft/Phi-4-mini-instruct
[^2] https://github.com/microsoft/PhiCookBook/tree/main/md/02.Application/07.FunctionCalling/Phi4
[^3] https://github.com/microsoft/PhiCookBook/blob/main/md/02.Application/07.FunctionCalling/Phi4/Multiagents/Phi_4_mini_multiagent.ipynb
@jpohhhh
Copy link
Author

jpohhhh commented Mar 15, 2025

(seems it added reviewers automatically; can't figure out how to remove)

re: template / params_init
I see, lmk if anything sounds off here: I'm building a Flutter GUI downstream of llama.cpp via github.com/telosnex/fllama. I usually steal bits and pieces from examples to hack together my libraries. Looking at server.cpp, I see I could just pass a string to common_chat_templates_init (a la chat_templates = common_chat_templates_init(model, **params_base.chat_template**); in server.cpp) when I'm using a Phi-4 model.

That'll work great, it makes sense to be picky / correct about that on the client end, and generically, use the built-in template as provided by the GGUF. It's a good mental model too, I've been struggling with a Gemma 3 GGUF template issue re: absolutely requiring alternating user/system only messages.

re: code
Thank you very much for your work.

I'm afraid I made a bit of mess while merging, I think it all came out right.

I added 3 more commits on top of that:

  1. Revert changes to server (it was debris, from the way I accidentally flattened the commits in your PR)
  2. Token consistency Make sure tokens are consistent with the actual model tokens (cross-checked using a reserved tokens list in their HF repo, I had used <|/tool_call> in some places and </|tool_call> in others, but its the first, <|/tool_call>)
  3. Tool responses as role: Docs on the tool calls are a bit sparse, there's not anything beyond the HF model card and Jupyter notebooks in a repo, but...good news -- there were a couple more tool call notebooks published earlier this week, they also demonstrate <|tool_response|>. It is treated like a role. This commit tweaks the jinja to do the same (<|tool_response|> coupled to end)

@jpohhhh
Copy link
Author

jpohhhh commented Mar 15, 2025

for posterity, re: tool responses:

Phi-4 tool calling's docs include the HuggingFace pages for Phi-4[^1^] and Jupyter notebooks at [^2^]. [^1^] only has <tool_call>, and the notebook at [^3^] is the only resource I've found that demonstrates tool responses.

It looks like it's used as a sort of role -- it doesn't directly show it 100% directly, but, the message format used defines tool_response as a role in a structure with text content, identical to the user/assistant messages in the same notebook.

Given that, and also it explaining another small mystery to me (why isn't <|/tool_response> in the reserved tokens?), I'll apply that here.

[^1^] https://huggingface.co/microsoft/Phi-4-mini-instruct
[^2^] https://github.com/microsoft/PhiCookBook/tree/main/md/02.Application/07.FunctionCalling/Phi4
[^3^] https://github.com/microsoft/PhiCookBook/blob/main/md/02.Application/07.FunctionCalling/Phi4/Multiagents/Phi_4_mini_multiagent.ipynb

jpohhhh added 4 commits March 16, 2025 15:35
1. Add <tool> tag wrapping individually for each tool
2. Avoid adding empty system message when there's no tools and no system message in input messages.
3. Move instructions to bottom (this worked a lot better when multiple tools added)

Re: 1

There's no concrete justification for this under than it *seems* better. Unfortunately, the only material for multiple tool calls x Phi-4 are based on ollama, which is downstream of llama.cpp, and either which way, obfuscates how its translated into multiple tools, and is unreliable per user reports.

https://colab.research.google.com/github/kinfey/Phi-3CookBook/blob/main/md/02.Application/07.FunctionCalling/Phi4/Multiagents/Phi_4_mini_multiagent.ipynb#scrollTo=NRgfmCBShDot%20via%20https://github.com/kinfey/Phi-3CookBook%20via%20https://github.com/kinfey/Phi-3CookBook/blob/main/md/02.Application/07.FunctionCalling/Phi4/Multiagents/Phi_4_mini_multiagent.ipynb
@jpohhhh
Copy link
Author

jpohhhh commented Mar 16, 2025

Latest is looking good, tests passing, code that works around template req'ing tools in system message removed.

Tweaked the template and tested it for an hour or two in my downstream testing app, LGTM.

Testing more or less looked at whether it was able to handle multiple tools, calling them correctly, and post-template-application input for inference looked right.

jpohhhh added a commit to Telosnex/fllama that referenced this pull request Mar 16, 2025
common/chat.cpp Outdated
if (!found_system_msg && !adjusted_messages.empty()) {
json system_msg = {
{"role", "system"},
{"content", "You are a helpful assistant with access to tools.\nTo use a tool, respond in this format: <|tool_call|>{\"name\": \"foo\", \"arguments\": {\"a\": 1}}<|/tool_call|>"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just ditch c++ reimplementation of a template in favour of models/templates/llama-cpp-microsoft-Phi-4-mini-instruct.jinja

Could you please add tests cases for phi-4 in examples/server/tests/unit/test_tool_call.py? (w/ the chatml template, and with my proposed template - which will need adjustments; comment out the other tests + define LLAMA_CACHE & DEBUG before running if you wanna save disk space and bandwidth)

@ochafik
Copy link
Collaborator

ochafik commented Mar 16, 2025

Thanks @jpohhhh, looks great! If you could just skip the original template in the format detector + maybe throw in some test cases in test_tool_calls.py, I'll be happy for this to be merged (oh and it's annoying but please fix the space nits flagged by editorconfig-checker, to get a green ci)

@jpohhhh
Copy link
Author

jpohhhh commented Mar 19, 2025

Ty_v_m, really appreciate the guidance --- somehow, this is my first GitHub PR ever and I'm still lost in the UI, I missed the failing lint check --- I bet it'll fail again after these commits (I'll check back within the hour), but beyond that, this may be good to go.

Also, I should flag test_tool_call.py, specifically line 397/test_calc_result: oddly, the override template leads to it not answering with tools, while bog-standard ChatML consistently passes (10/10 trials). I left in the override template test, commented out, with a comment, as ersatz documentation of why the override template is not tested

Finally: I'm using your work to replace my client app's old-style JSON-grammar-calculating-and-forcing, to have Phi-4 and Gemma 3 be real agentic models. I have a handful of small questions around that, like, making sure I grok and am taking advantage of all the work --- I'm curious what you'd recommend for communicating about that, I was thinking maybe a discussion on this GitHub repo? (vs. an issue / PR)

@jpohhhh jpohhhh requested a review from ochafik March 19, 2025 18:50
Copy link
Collaborator

@ochafik ochafik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpohhhh Sorry for the delay!

{%- endif -%}
{%- for message in messages -%}
{%- if message['role'] == 'tool' -%}
{{- '<|tool_response|>' + (message['content'] | tojson) + '<|end|>' -}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is seems to be the cause of the test_calc_result failure. Good news is, the following (wild hunch / inspired by other tool call styles) makes it work:

{{- '<|tool_response|>' + message['name'] + '<|tag|>' + (message['content'] | tojson) + '<|end|>' -}}

@@ -385,6 +394,9 @@ def do_test_weather(server: ServerProcess, **kwargs):
@pytest.mark.slow
@pytest.mark.parametrize("result_override,n_predict,hf_repo,template_override", [
(None, 128, "bartowski/Phi-3.5-mini-instruct-GGUF:Q4_K_M", "chatml"),
# Answers using text, not tools, complaining it wants to measure from the positive Z-axis not x-axis.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this (crucial) test doesn't test tool call emission, but leveraging of tool call results. Failure indicates that the model doesn't understand the syntax used to give it the result.

@@ -306,6 +312,9 @@ def test_completion_without_tool_call_slow(template_name: str, n_predict: int, t
("bartowski/Phi-3.5-mini-instruct-GGUF:Q4_K_M", None),
("bartowski/Phi-3.5-mini-instruct-GGUF:Q4_K_M", "chatml"),

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you throw in a default (generic) test case for hello_world, weather and calc_result? Just in case their default template changes and something goes boom.

("bartowski/microsoft_Phi-4-mini-instruct-GGUF:Q4_K_M",      None),

@ochafik
Copy link
Collaborator

ochafik commented Mar 24, 2025

Ty_v_m, really appreciate the guidance

@jpohhhh my pleasure!

Also, I should flag test_tool_call.py, specifically line 397/test_calc_result: oddly, the override template leads to it not answering with tools, while bog-standard ChatML consistently passes (10/10 trials). I left in the override template test, commented out, with a comment, as ersatz documentation of why the override template is not tested

That is a sign of a serious underlying issue, proposed a magic fix in the review thread (give back the function name in the tool result, separated from content w/ a tag - since it's one of the very few special tokens defined in this model's vocab)

Finally: I'm using your work to replace my client app's old-style JSON-grammar-calculating-and-forcing, to have Phi-4 and Gemma 3 be real agentic models. I have a handful of small questions around that, like, making sure I grok and am taking advantage of all the work --- I'm curious what you'd recommend for communicating about that, I was thinking maybe a discussion on this GitHub repo? (vs. an issue / PR)

Using discussions sounds great (or wherever you tag me 😜), and we can open issues to track any work needed as it comes up ✌️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
android Issues specific to Android Apple Metal https://en.wikipedia.org/wiki/Metal_(API) devops improvements to build systems and github actions documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes server SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants