Skip to content

server: streaming of tool calls and thoughts when --jinja is on #12379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 102 commits into
base: master
Choose a base branch
from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Mar 14, 2025

This PR is still WIP (see todos at the bottom) but welcoming early feedback / testing

  • Support streaming of tool calls in OpenAI format
  • Improve handling of thinking model (DeepSeek R1 Distills, QwQ, Command R7B):
    • Stream <think> reasoning content inside the content (same output for all thinking models when using the default --reasoning-content deepseek, even for those not using the <think> syntax like Command R7B), and even if the <think> tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).
    • Avoid spurious lazy (tool call) grammar triggers from "thoughts about tool calls" (only trigger after closing any unclosed thoughts)
  • Improves Functionary v3.2 support (allow raw python code, preferred by models over {"code": "json-encoded code"} for multiline programs)
  • Support truncated outputs incl. reasoning_content & tool_calls (returns salvageable fields when finish_reason = length)

This fixes #12107, #10920, #11861

Follow up to #9639

How to test / use

  • Get and build this PR's branch
    git clone https://github.com/ggml-org/llama.cpp
    cd llama.cpp
    git remote add ochafik https://github.com/ochafik/llama.cpp
    git fetch ochafik
    git checkout ochafik/tool-diffs
    cmake -B build -DLLAMA_CURL=1 # -DGGML_CUDA=1 ...
    cmake --build build -t llama-server --parallel --config Release
    alias llama-server=./build/bin/llama-server
  • Run llama-server w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):

    # Thoughts of Command R7B / DeepSeek R1 / QwQ will be streamed in the content inside <think> tags
    llama-server --jinja -fa -hf bartowski/Qwen_QwQ-32B-GGUF
    
    # Models w/ generic tool call support now return clean interrupted output when hitting token limit
    llama-server --jinja -fa -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF
    
  • Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:

    curl http://localhost:8080/v1/chat/completions -d '{
      "model": "gpt-3.5-turbo",
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "python",
            "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters": {
              "type": "object",
              "properties": {
                "code": {
                  "type": "string",
                  "description": "The code to run in the ipython interpreter."
                }
              },
              "required": ["code"]
            }
          }
        }
      ],
      "messages": [
        {
          "role": "user",
          "content": "Print a hello world message with python."
        }
      ],
      "stream": true
    }'
  • You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening <think> tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since —reasoning-format deepseek is the default)

Context

Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.

While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.

Function calls and their arguments can be at various levels:

  • JSON array of tool calls (e.g. Mistral Nemo: [TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}])
  • Standalone JSON tool call (e.g. Hermes syntax: <tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>; note that some models use other keys here, e.g. tool_name, parameters, and may have the tool call id too)
  • JSON arguments object w/ name in some prefix (e.g. Deepseek: <|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>special_function\n```json\n{"arg1": 1}\n```<|tool▁call▁end|><|tool▁calls▁end|>, or functionary v3.2: special_function\n{"arg1": 1})
  • Nested JSON for the generic mode {"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}} (or inside tool_calls array if parallel_tool_calls is on)
  • No JSON / raw code string for python tool call, with two variants:
    • Unconstrained verbatim code: <|python_tag|>multiline python code here (functionary v3.1), python\nmultiline python code here (functionary v3.2; w/ prefix >>> if after textual response)
    • Constrained pythonish syntax for "builtin tools" (Llama 3.x, quite widespread): <|python_tag|>python.call(code="multiline\npython\ncode\nhere")

Side note about raw python code: <|python_tag>foo.call(bar="baz") in Llama 3.x style will return "tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}], while the same output from Functionary would be parsed as "tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}].

Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.

(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)

The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full common_chat_msg against the last one we sent back, and compute OpenAI-compatible deltas out of this.

Location, location, location 🏡

Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g. <think> isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.

But more interesting is where it happens, esp. for partial JSON:

  • If it happens inside an arguments object or a contents string (for generic mode), we should return it partial / truncated (and json-dumped in the case of the arguments), and diffed from the last parsed value for the streamed case
  • If it happens inside the wrapper of the arguments, then it depends. We don't want to get a half-function name, but as soon as we have a complete function name we can send a diff. So we try and heal the JSON (we identify which json paths can be partially healed - because they're inside the arguments, and which ones must be dropped), and only populate a tool call if we have at least a name). Likewise, if there is an array of function calls with the first complete, and the next partial, we want to make sure the client can start calling the first function.

tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for /* is_partial= */ true)

See examples of streamed tool call deltas
curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
        {
        "type":"function",
        "function":{
            "name":"python",
            "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters":{
            "type":"object",
            "properties":{
                "code":{
                "type":"string",
                "description":"The code to run in the ipython interpreter."
                }
            },
            "required":["code"]
            }
        }
        }
    ],
    "messages": [
        {
        "role": "user",
        "content": "Print a hello world message with python."
        }
    ], "stream": true
}'
data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_aqwOReHDKPnqiF7NbRxzDTY1","type":"function","function":{"name":"python","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"code"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"print"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"('"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"Hello"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":","}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":" World"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"!"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"')"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"}"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]}

data: [DONE]

Implementation notes

Partial parsing utils

I added a common_chat_msg_parser utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:

  • Supports partial regex parsing
    • Since the STL still doesn't have partial matching support (unlike Boost), I had to implement my own in common_regex (see common/regex-partial.cpp).
    • The trick = transform the original regex to a regex that matches in reverse from the end of the string (e.g. /abc/ gives /((?:(?:c)?b)?a)[\s\S]*/, with a single capturing group which end indicates - in reverse - where the partial match started)
  • Supports partial JSON parsing:
    • Used nlohmann/json's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parse
    • Healing the JSON w/ a healing marker that can then be found when visiting the resulting JSON (to remove things we don't want to heal - e.g. function name - and cut any JSON encoded result at the "right" place, which must be somewhere inside function arguments: consume_json accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)
  • Supports control flow w/ try_* parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart from optionals, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.
  • Supports full or partial parsing w/ same code (throws partial exceptions to interrupt the control flow without making parsing code more complex)

This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when finish_reason != tool_call).

To think or not to think... what is the prompt?

I've also introduced common_chat_syntax which wraps common_reasoning_format, common_chat_format together with:

  • thinking_forced_open: whether the prompt was detected to end w/ a (model-specific) <think> tag to force thinking mode
  • reasoning_in_content: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.

This allows streaming back a standard <think>... syntax even for models that use a different set of tags (e.g. Command R7B). And of course, --reasoning-format none is still allowed to get the raw output.

Note: Ideally, we'd stream the thoughts as a reasoning_content delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if --reasoning-format deepseek, which is the default).

Triggering thoughts 😓

I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.

To address this, I made it possible for common_chat_templates_apply to create trigger regexes that match on the entire output (this was already the case in the sampler). COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL (renamed from _START) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.

Functionary v3.2 w/ raw python

Ask bartowski/functionary-small-v3.2-GGUF:Q4_K_M to write a hello world in Python and it outputs python\n{"code": "print('hey')"}.

But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax python\nprint('hey')\n# many other lines. This is now supported.

TODOs

  • Fix tool call id attribution logic (disabled for now) from tool-call: ensure there's always a non-empty tool call id #12292
  • Might need one last diff in the final response after a stream, say, to close any raw python code
  • Decide what to do about logprobs for tools mode (right now, forbidden; we don't return diffs for every token, for instance if a function name is in multiple tokens we don't want to send its name in chunks)
    • Edit: OpenAI returns null logpropbs in tool call mode. Just need to ensure normal mode doesn't regress (test failing atm)
  • Fix Mistral Nemo crash (llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L)
  • Send partial regex (common_regex) as separate PR: common: add partial regex support #12808
  • Send partial JSON (common_json) as separate PR(?) or fold into chat-parser.cpp
  • Command R7B's non-tool-calling template (they have 3 templates) forces <|START_RESPONSE|> at the end of the prompt. Output will contain an <|END_RESPONSE|> that needs handling (would fit nicely in new common_chat_syntax struct). Maybe combine w/ forced/disabled thinking modes as a follow up PR
  • Add some docs
  • Add more tests
  • Run scripts/tool_bench.sh to compare against master (+ compare timings)

Future follow ups:

  • To make this faster, I suggest two options:
    • Wait for the project to switch to C++20 & turn all the parser functions into resumable coroutines (feed them tokens and persist their state in the slot)
    • Only compute and send deltas after N milliseconds

cc/ @jpohhhh

@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples python python script changes server labels Mar 14, 2025
@cgruver
Copy link

cgruver commented May 16, 2025

For those who are following this PR, I am trying to maintain a merge from this branch and the master branch of llama.cpp here - https://github.com/cgruver/llama.cpp/tree/tools

@ochafik ochafik marked this pull request as ready for review May 16, 2025 23:03
@ochafik ochafik requested a review from ngxson as a code owner May 16, 2025 23:03
@ochafik ochafik marked this pull request as draft May 16, 2025 23:41
@unclemusclez
Copy link

DcNIsW

@pwilkin
Copy link
Contributor

pwilkin commented May 19, 2025

Seems there's a bug in the current code version with executing streaming tools under reasoning models.

I'm trying it with Qwen3 and the following sequence causes a server crash:

  • user query
  • reasons -> calls tool
  • tool response
  • reasons -> tries to respond

As soon as the second reasoning section is cleared, the server crashes with:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '<think>*truncated for brevity*... The user might be looking for the slang definition, so I should highlight that. Also, note the different contexts like productivity issues and meme culture. Make sure to mention the sources and provide a concise explanation.</think>The term **"brainrot"** has multiple contextual meanings based on the search results' not found at start of '<think>*truncated for brevity*... The user might be looking for the slang definition, so I should highlight that. Also, note the different contexts like productivity issues and meme culture. Make sure to mention the sources and provide a concise explanation.</think>'

The chat that triggers it looks like this:

[
  {
    "role": "user",
    "content": "Let's search the web for \"brainrot\"."
  },
  {
    "role": "assistant",
    "tool_calls": [
      {
        "id": "0",
        "name": "search",
        "arguments": {
          "query": "brainrot"
        }
      }
    ]
  },
  {
    "role": "tool",
    "content": "*tool call dump truncated*",
    "tool_call_id": "0"
  }
]

Before that, I get warnings like this:
Grammar still awaiting trigger after token 2711 ( search) (after every token that succeeds the thinking block)

@teleprint-me
Copy link
Contributor

teleprint-me commented May 20, 2025

This is super awesome! I'm having a blast with this right now. I wanted to create a simple wrapper script to test it out with Qwen3 to see how it went. So far, it seems like it's working as intended.

I followed the instructions and tested it out with curl to see how it went and got a successful response, so I took it a few steps further and built a minimal wrapper script using OpenAI tooling to test its limits.

Server request seems clean. Model is able to respond and execute the command as well. Still working my way towards chaining messages together.

Llama Server Instance Command
llama-server --port 8080 --n-gpu-layers 32 --ctx-size 16384 --pooling mean --slots --jinja -fa -m /mnt/valerie/models/Qwen/Qwen3-1.7B/ggml-model-f16.gguf
Dot Env
OPENAI_API_KEY=sk-no-key-required
OPENAI_BASE_URL=http://localhost:8080/v1
Source File
import json
import os
import sys

import dotenv
from openai import OpenAI
from openai.types.chat.chat_completion_chunk import ChatCompletionChunk

from agent.tools.weather import get_weather

ESCAPE = "\x1b"
BOLD = ESCAPE + "[1m"
UNDERLINE = ESCAPE + "[4m"
RESET = ESCAPE + "[0m"


def create_client():
    # Load environment
    dotenv.load_dotenv(".env")

    api_key = os.getenv("OPENAI_API_KEY", "")
    base_url = os.getenv("OPENAI_BASE_URL", "")

    if not api_key:
        raise ValueError("EnvironmentError: OPENAI_API_KEY not set in .env")

    # Setup default base URL if using local mode
    if api_key == "sk-no-key-required" and not base_url:
        base_url = "http://localhost:8080/v1"

    # Initialize client
    return OpenAI(api_key=api_key, base_url=base_url)


def stream_response(response):
    tool_call_buffer = ""
    buffering_tool = False
    finish_reason = None

    for chunk in response:
        if isinstance(chunk, ChatCompletionChunk):
            delta = chunk.choices[0].delta
            finish_reason = chunk.choices[0].finish_reason

            # Handle streaming reasoning
            if delta.content:
                content = delta.content
                if content == "<think>":
                    print(f"{UNDERLINE}{BOLD}Thinking{RESET}", end="\n")
                elif content == "</think>":
                    print(f"\n{UNDERLINE}{BOLD}Completion{RESET}", end="")
                else:
                    print(content, end="")
                sys.stdout.flush()

            # Handle tool call streaming
            if delta.tool_calls:
                buffering_tool = True
                for tool_call in delta.tool_calls:
                    arguments = tool_call.function.arguments or ""
                    tool_call_buffer += arguments

    print()  # Newline after stream ends

    # Dispatch if tool call is complete
    if buffering_tool and finish_reason == "tool_calls":
        try:
            tool_args = json.loads(tool_call_buffer)
            print(f"\n{UNDERLINE}{BOLD}Calling Tool...{RESET}")
            result = get_weather(**tool_args)
            print(f"\n{UNDERLINE}{BOLD}Tool Result:{RESET} {result}")
        except json.JSONDecodeError:
            print(f"{BOLD}Warning:{RESET} Failed to decode tool call arguments.")


def main():
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Retrieves current weather for the given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "units": {
                            "type": "string",
                            "enum": ["metric", "uscs"],
                            "description": "The unit system. Default is 'metric'.",
                        },
                    },
                    "required": ["location", "units"],
                    "additionalProperties": False,
                },
                "strict": True,
            },
        }
    ]

    # Sample chat sequence
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the weather like in Paris today?"},
    ]

    try:
        client = create_client()
        response = client.chat.completions.create(
            model="qwen3",  # Use "gpt-4" for OpenAI, "qwen3" for local
            messages=messages,
            stream=True,
            temperature=0.8,
            tools=tools,
        )
        stream_response(response)
    except Exception as e:
        print(f"Error: {e}")


if __name__ == "__main__":
    main()
Llama Server Request
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 746 | processing task
slot update_slots: id  0 | task 746 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 223
slot update_slots: id  0 | task 746 | need to evaluate at least 1 token to generate logits, n_past = 223, n_prompt_tokens = 223
slot update_slots: id  0 | task 746 | kv cache rm [222, end)
slot update_slots: id  0 | task 746 | prompt processing progress, n_past = 223, n_tokens = 1, progress = 0.004484
slot update_slots: id  0 | task 746 | prompt done, n_past = 223, n_tokens = 1
slot      release: id  0 | task 746 | stop processing: n_past = 345, truncated = 0
slot print_timing: id  0 | task 746 | 
prompt eval time =      29.95 ms /     1 tokens (   29.95 ms per token,    33.38 tokens per second)
       eval time =    2268.35 ms /   123 tokens (   18.44 ms per token,    54.22 tokens per second)
      total time =    2298.30 ms /   124 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
Screencast.From.2025-05-20.00-44-34.webm

I love this. This is awesome. I've been jonsing for local memory management and this absolutely opens the door for doing exactly that. I can not express enough gratitude for all the work that's gone into all of this. Awesome work!

I'll keep an eye out for edge cases if I think it's relevant to this PR.

One minor bug I think I already spotted is that the initial tokens are coupled.

print(f"::{content}", end="")

I just prepended a pair of colons together to attempt to reveal why I couldn't select <think> which I can do in the master branch.

::<think>Okay::,:: the:: user:: is:: asking:: 

<think> and Okay should be separate tokens?

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<think>Okay"}}],"created":1747719891,"id":"chatcmpl-WDzWFkqkzx7pLHN3wBk9odyvdB1szfBn","mo
del":"gpt-3.5-turbo","system_fingerprint":"b5510-810c4c32","object":"chat.completion.chunk"}

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented May 20, 2025

@teleprint-me Nice example. If there isn't already something equivalent, it'd maybe be worth adding to the examples once this is merged.

The closest I see is the curl command at the bottom of /docs/function-calling.md, but yours is a fairly complete python example.

@khimaros
Copy link
Contributor

tool call support also works quite well in Cherry Studio, and streaming can be disabled for those with unpatched server

@unclemusclez
Copy link

unclemusclez commented May 20, 2025

As I understand it, for this update to work correctly with Qwen 3, #13189 #13196 also needs to be merged?

Thank you!

@pwilkin
Copy link
Contributor

pwilkin commented May 20, 2025

I mean, the things is tool calling should work with thinking. The models lose quite a bit of quality without thinking enabled.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 20, 2025

@strawberrymelonpanda The README.md links to the OpenAI REST API docs and specification which explains how to use it.

It makes sense that it's not included in there because it's already described by the docs.

If collaborators are inclined to include it, I'm okay with that. This is just boilerplate.

I think it would be useful to have a simple example for this in C/C++ instead which exposes core Llama and GGML API functions and how to use them. That would be more relavent.

@unclemusclez That's interesting. According to Qwen3 official README.md, you can use a shorthand by appending /no_think to user input to disable it. I tested this last night and it worked fine. The template remains the same. Though, I haven't looked into it in depth.

I've been MIA because I'm learning fundamentals for neural networking, transformers, vulkan compute, etc which is challenging and time consuming. Filling in my gaps in my understanding of how this tech works at a basic level.

@unclemusclez
Copy link

unclemusclez commented May 20, 2025

@strawberrymelonpanda The README.md links to the OpenAI REST API docs and specification which explains how to use it.

It makes sense that it's not included in there because it's already described by the docs.

If collaborators are inclined to include it, I'm okay with that. This is just boilerplate.

I think it would be useful to have a simple example for this in C/C++ instead which exposes core Llama and GGML API functions and how to use them. That would be more relavent.

@unclemusclez That's interesting. According to Qwen3 official README.md, you can use a shorthand by appending /no_think to user input to disable it. I tested this last night and it worked fine. The template remains the same. Though, I haven't looked into it in depth.

I've been MIA because I'm learning fundamentals for neural networking, transformers, vulkan compute, etc which is challenging and time consuming. Filling in my gaps in my understanding of how this tech works at a basic level.

@teleprint-me so this means just add /no_think to every single prompt? there is no environment flag? i dont have to run llama-server differently? yes.

so this is jut a codex issue. im not sure if N8N has a problem with thinking, but the /no_think will work.

i was under the impression that some of the magic is lost without think. but, at least this works.

@taha-yassine
Copy link

@strawberrymelonpanda The README.md links to the OpenAI REST API docs and specification which explains how to use it.

It makes sense that it's not included in there because it's already described by the docs.

If collaborators are inclined to include it, I'm okay with that. This is just boilerplate.

I think it would be useful to have a simple example for this in C/C++ instead which exposes core Llama and GGML API functions and how to use them. That would be more relavent.

@unclemusclez That's interesting. According to Qwen3 official README.md, you can use a shorthand by appending /no_think to user input to disable it. I tested this last night and it worked fine. The template remains the same. Though, I haven't looked into it in depth.

I've been MIA because I'm learning fundamentals for neural networking, transformers, vulkan compute, etc which is challenging and time consuming. Filling in my gaps in my understanding of how this tech works at a basic level.

@teleprint-me so this means just add /no_think to every single prompt? there is no environment flag? i dont have to run llama-server differently?

/no_think is just a soft switch meaning it's up to the model to honor it or not. Proper support for enable_thinking would be more robust.

@teleprint-me
Copy link
Contributor

It was an experiment and lighthearted observation. I would recommend reading the documentation and making your own interpretation — That's what I did.

As for how it operates internally, usually theres a paper, or if youre daring — you can dive deep into the transformers and torch code to see how it works.

I've done this and I hated every moment of it, but I learned a lot along the way. I even figured out how to manipulate the internals of the interfaces to build my own custom diffuse pipelines, but I find text and chat completions to be highly satisfying and rewarding in comparison.

Probably why I like and prefer llama.cpp compared to other options.

I have nothing else to add at the moment as this is now off-topic. RTFM.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 21, 2025

So, I've been hacking together a simple agentic system to experiment with this PR and ran into something interesting.

I eventually realized (after reviewing the backtrace) that the server was comparing inputs to previous outputs - and these seem fairly strict. If the input is not exactly the same as the output, the server throws an error, raises an exception, and then crashes.

Basically, I was omitting the models previous thought process from the aggregated messages sequence and it crashes after performing compute_diffs around line 86.

In this PR, we can see common_chat_msg_diff defines this set of rules. Lines 72 to 84 in chat.h to be specific.

Modifying the code by making a few experimental tweaks to reconstruct the original message resolved the initial server crash.

I'm not sure if it was my code, this PR, or something that was baked into the server over the past year, but I'm wondering if it is absolutely necessary and why such a strict rule set is in place; I've been out of the loop for some time.

If this is out of scope of this PR, feel free let me know and provide suggestions as to where this might be more appropriate. I posted this here because I'm currently experimenting with this PR specifically.


I've provided a detailed summary including the backtrace at the time the server crashed below.

GDB Runtime Output
03:36:35 | ~/.bin/cpp/llama.cpp
 git:(tool-diffs | θ) λ gdb --quiet -ex='break main' -ex=run --args llama-server --port 8080 --n-gpu-layers 32 --ctx-size 16384 --pooling mean --slots --jinja -fa -m /mnt/valerie/models/Qwen/Qwen3-1.7B/ggml-model-f16.gguf
Reading symbols from llama-server...
Breakpoint 1 at 0xa8584: file /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp, line 3582.
Starting program: /mnt/paper/.bin/llama-server --port 8080 --n-gpu-layers 32 --ctx-size 16384 --pooling mean --slots --jinja -fa -m /mnt/valerie/models/Qwen/Qwen3-1.7B/ggml-model-f16.gguf

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n]) y
Debuginfod has been enabled.
To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit.
Downloading separate debug info for /lib64/ld-linux-x86-64.so.2
Downloading 5.05 M separate debug info for /usr/lib/libcurl.so.4                                                                                                                                                                                                                                                                
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.                                                                                                                                                                                                                                         
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
Downloading 3.07 M separate debug info for /usr/lib/libm.so.6
Downloading 9.66 M separate debug info for /usr/lib/libc.so.6                                                                                                                                                                                                                                                                   
[Thread debugging using libthread_db enabled]                                                                                                                                                                                                                                                                                   
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Downloading 476.11 K separate debug info for /usr/lib/libnghttp3.so.9
Downloading 550.52 K separate debug info for /usr/lib/libnghttp2.so.14                                                                                                                                                                                                                                                          
Downloading 108.19 K separate debug info for /usr/lib/libidn2.so.0                                                                                                                                                                                                                                                              
Downloading 1.22 M separate debug info for /usr/lib/libssh2.so.1                                                                                                                                                                                                                                                                
Downloading 53.44 K separate debug info for /usr/lib/libpsl.so.5                                                                                                                                                                                                                                                                
Downloading 4.02 M separate debug info for /usr/lib/libssl.so.3                                                                                                                                                                                                                                                                 
Downloading 14.15 M separate debug info for /usr/lib/libcrypto.so.3                                                                                                                                                                                                                                                             
Downloading 1.75 M separate debug info for /usr/lib/libgssapi_krb5.so.2                                                                                                                                                                                                                                                         
Downloading 5.77 M separate debug info for /usr/lib/libzstd.so.1                                                                                                                                                                                                                                                                
Downloading 214.80 K separate debug info for /usr/lib/libbrotlidec.so.1                                                                                                                                                                                                                                                         
Downloading 194.53 K separate debug info for /usr/lib/libz.so.1                                                                                                                                                                                                                                                                 
Downloading 3.40 M separate debug info for /usr/lib/libvulkan.so.1                                                                                                                                                                                                                                                              
Downloading 1.62 M separate debug info for /usr/lib/libunistring.so.5                                                                                                                                                                                                                                                           
Downloading 2.70 M separate debug info for /usr/lib/libkrb5.so.3                                                                                                                                                                                                                                                                
Downloading 685.64 K separate debug info for /usr/lib/libk5crypto.so.3                                                                                                                                                                                                                                                          
Downloading 24.72 K separate debug info for /usr/lib/libcom_err.so.2                                                                                                                                                                                                                                                            
Downloading 140.23 K separate debug info for /usr/lib/libkrb5support.so.0                                                                                                                                                                                                                                                       
Downloading 34.54 K separate debug info for /usr/lib/libkeyutils.so.1                                                                                                                                                                                                                                                           
Downloading 207.47 K separate debug info for /usr/lib/libresolv.so.2                                                                                                                                                                                                                                                            
Downloading 22.55 K separate debug info for /usr/lib/libbrotlicommon.so.1                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                
Breakpoint 1, main (argc=14, argv=0x7fffffffd918) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:3582
3582	int main(int argc, char ** argv) {
(gdb) c
Continuing.
Downloading 12.21 M separate debug info for /usr/lib/libvulkan_radeon.so
Downloading 401.89 M separate debug info for /usr/lib/libLLVM.so.19.1                                                                                                                                                                                                                                                           
Downloading 922.55 K separate debug info for /usr/lib/libelf.so.1                                                                                                                                                                                                                                                               
Downloading 39.01 K separate debug info for /usr/lib/libxcb-dri3.so.0                                                                                                                                                                                                                                                           
Downloading 146.31 K separate debug info for /usr/lib/libwayland-client.so.0                                                                                                                                                                                                                                                    
Downloading 500.05 K separate debug info for /usr/lib/libxcb.so.1                                                                                                                                                                                                                                                               
Downloading 34.59 K separate debug info for /usr/lib/libX11-xcb.so.1                                                                                                                                                                                                                                                            
Downloading 25.20 K separate debug info for /usr/lib/libxcb-present.so.0                                                                                                                                                                                                                                                        
Downloading 72.90 K separate debug info for /usr/lib/libxcb-xfixes.so.0                                                                                                                                                                                                                                                         
Downloading 57.62 K separate debug info for /usr/lib/libxcb-sync.so.1                                                                                                                                                                                                                                                           
Downloading 159.27 K separate debug info for /usr/lib/libxcb-randr.so.0                                                                                                                                                                                                                                                         
Downloading 24.12 K separate debug info for /usr/lib/libxcb-shm.so.0                                                                                                                                                                                                                                                            
Downloading 11.27 K separate debug info for /usr/lib/libxshmfence.so.1                                                                                                                                                                                                                                                          
Downloading 17.48 K separate debug info for /usr/lib/libxcb-keysyms.so.1                                                                                                                                                                                                                                                        
Downloading 242.70 K separate debug info for /usr/lib/libdrm.so.2                                                                                                                                                                                                                                                               
Downloading 1.49 M separate debug info for /usr/lib/libudev.so.1                                                                                                                                                                                                                                                                
Downloading 370.80 K separate debug info for /usr/lib/libexpat.so.1                                                                                                                                                                                                                                                             
Downloading 139.80 K separate debug info for /usr/lib/libdrm_amdgpu.so.1                                                                                                                                                                                                                                                        
Downloading 36.17 M separate debug info for /usr/lib/libSPIRV-Tools.so                                                                                                                                                                                                                                                          
Downloading 121.48 K separate debug info for /usr/lib/libffi.so.8                                                                                                                                                                                                                                                               
Downloading 610.80 K separate debug info for /usr/lib/libedit.so.0                                                                                                                                                                                                                                                              
Downloading 3.53 M separate debug info for /usr/lib/libxml2.so.16                                                                                                                                                                                                                                                               
Downloading 29.59 K separate debug info for /usr/lib/libXau.so.6                                                                                                                                                                                                                                                                
Downloading 52.94 K separate debug info for /usr/lib/libXdmcp.so.6                                                                                                                                                                                                                                                              
Downloading 98.30 K separate debug info for /usr/lib/libcap.so.2                                                                                                                                                                                                                                                                
Downloading 1.70 M separate debug info for /usr/lib/libncursesw.so.6                                                                                                                                                                                                                                                            
Downloading 889.26 K separate debug info for /usr/lib/liblzma.so.5                                                                                                                                                                                                                                                              
Downloading 10.53 M separate debug info for /usr/lib/libicuuc.so.76                                                                                                                                                                                                                                                             
Downloading 30.38 M separate debug info for /usr/lib/libicudata.so.76                                                                                                                                                                                                                                                           
[New Thread 0x7fffe8bff6c0 (LWP 70357)]                                                                                                                                                                                                                                                                                         
[New Thread 0x7fffe3fff6c0 (LWP 70358)]
[New Thread 0x7fffe37fe6c0 (LWP 70359)]
[New Thread 0x7fffe2ffd6c0 (LWP 70360)]
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7600 XT (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon RX 7600 XT (RADV NAVI33))
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 7 7700X 8-Core Processor)
load_backend: failed to find ggml_backend_init in /mnt/paper/.bin/cpp/llama.cpp/build/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /mnt/paper/.bin/cpp/llama.cpp/build/bin/libggml-cpu.so
[New Thread 0x7fffe27fc6c0 (LWP 70361)]
build: 5510 (810c4c32) with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu (debug)
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
[New Thread 0x7fffe1ffb6c0 (LWP 70362)]
[New Thread 0x7fffe17fa6c0 (LWP 70363)]
[New Thread 0x7fffe0ff96c0 (LWP 70364)]
[New Thread 0x7fffcbfff6c0 (LWP 70365)]
[New Thread 0x7fffcb7fe6c0 (LWP 70366)]
[New Thread 0x7fffcaffd6c0 (LWP 70367)]
[New Thread 0x7fffca7fc6c0 (LWP 70368)]
[New Thread 0x7fffc9ffb6c0 (LWP 70369)]
[New Thread 0x7fffc97fa6c0 (LWP 70370)]
[New Thread 0x7fffc8ff96c0 (LWP 70371)]
[New Thread 0x7fffc87f86c0 (LWP 70372)]
[New Thread 0x7fffc7ff76c0 (LWP 70373)]
[New Thread 0x7fffc77f66c0 (LWP 70374)]
[New Thread 0x7fffc6ff56c0 (LWP 70375)]
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 15
main: loading model
srv    load_model: loading model '/mnt/valerie/models/Qwen/Qwen3-1.7B/ggml-model-f16.gguf'
[New Thread 0x7fffc67f46c0 (LWP 70376)]
[New Thread 0x7fffc5ff36c0 (LWP 70377)]
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 7600 XT (RADV NAVI33)) - 16368 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 311 tensors from /mnt/valerie/models/Qwen/Qwen3-1.7B/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 1.7B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 1.7B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-1.7...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 1.7B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-1.7...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv  12:                          qwen3.block_count u32              = 28
llama_model_loader: - kv  13:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv  14:                     qwen3.embedding_length u32              = 2048
llama_model_loader: - kv  15:                  qwen3.feed_forward_length u32              = 6144
llama_model_loader: - kv  16:                 qwen3.attention.head_count u32              = 16
llama_model_loader: - kv  17:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  18:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  19:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  21:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  22:                          general.file_type u32              = 1
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - type  f32:  113 tensors
llama_model_loader: - type  f16:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 3.78 GiB (16.00 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 2048
print_info: n_layer          = 28
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1.7B
print_info: model params     = 2.03 B
print_info: general.name     = Qwen3 1.7B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:      Vulkan0 model buffer size =  3281.97 MiB
load_tensors:   CPU_Mapped model buffer size =   593.50 MiB
.......................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified: kv_size = 16384, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1, padding = 256
llama_kv_cache_unified:    Vulkan0 KV buffer size =  1792.00 MiB
llama_kv_cache_unified: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_context:    Vulkan0 compute buffer size =   300.75 MiB
llama_context: Vulkan_Host compute buffer size =    36.01 MiB
llama_context: graph nodes  = 959
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[New Thread 0x7fff22d116c0 (LWP 70379)]
[New Thread 0x7fff225106c0 (LWP 70380)]
[New Thread 0x7fff21d0f6c0 (LWP 70381)]
[New Thread 0x7fff2150e6c0 (LWP 70382)]
[New Thread 0x7fff20d0d6c0 (LWP 70383)]
[New Thread 0x7fff13fff6c0 (LWP 70384)]
[New Thread 0x7fff137fe6c0 (LWP 70385)]
[New Thread 0x7fff12ffd6c0 (LWP 70386)]
[New Thread 0x7fff127fc6c0 (LWP 70387)]
[New Thread 0x7fff11ffb6c0 (LWP 70388)]
[New Thread 0x7fff117fa6c0 (LWP 70389)]
[Thread 0x7fff225106c0 (LWP 70380) exited]
[Thread 0x7fff22d116c0 (LWP 70379) exited]
[Thread 0x7fff13fff6c0 (LWP 70384) exited]
[Thread 0x7fff11ffb6c0 (LWP 70388) exited]
[Thread 0x7fff21d0f6c0 (LWP 70381) exited]
[Thread 0x7fff117fa6c0 (LWP 70389) exited]
[Thread 0x7fff137fe6c0 (LWP 70385) exited]
[Thread 0x7fff12ffd6c0 (LWP 70386) exited]
[Thread 0x7fff127fc6c0 (LWP 70387) exited]
[Thread 0x7fff2150e6c0 (LWP 70382) exited]
[Thread 0x7fff20d0d6c0 (LWP 70383) exited]
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 16384
main: model loaded
main: chat template, chat_template: {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}
{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set content = message.content %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in message.content %}
                {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
                {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and reasoning_content) %}
                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- message.content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- endif %}
{%- endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 232
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 232, n_tokens = 232, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 232, n_tokens = 232
[New Thread 0x7fff117fa6c0 (LWP 70434)]
[Thread 0x7fff117fa6c0 (LWP 70434) exited]
[New Thread 0x7fff117fa6c0 (LWP 70435)]
[Thread 0x7fff117fa6c0 (LWP 70435) exited]
slot      release: id  0 | task 0 | stop processing: n_past = 379, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     191.82 ms /   232 tokens (    0.83 ms per token,  1209.49 tokens per second)
       eval time =    2716.36 ms /   148 tokens (   18.35 ms per token,    54.48 tokens per second)
      total time =    2908.18 ms /   380 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 149 | processing task
slot update_slots: id  0 | task 149 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 232
slot update_slots: id  0 | task 149 | need to evaluate at least 1 token to generate logits, n_past = 232, n_prompt_tokens = 232
slot update_slots: id  0 | task 149 | kv cache rm [231, end)
slot update_slots: id  0 | task 149 | prompt processing progress, n_past = 232, n_tokens = 1, progress = 0.004310
slot update_slots: id  0 | task 149 | prompt done, n_past = 232, n_tokens = 1
slot      release: id  0 | task 149 | stop processing: n_past = 399, truncated = 0
slot print_timing: id  0 | task 149 | 
prompt eval time =      30.24 ms /     1 tokens (   30.24 ms per token,    33.07 tokens per second)
       eval time =    3104.30 ms /   168 tokens (   18.48 ms per token,    54.12 tokens per second)
      total time =    3134.54 ms /   169 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 318 | processing task
slot update_slots: id  0 | task 318 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 273
slot update_slots: id  0 | task 318 | kv cache rm [232, end)
slot update_slots: id  0 | task 318 | prompt processing progress, n_past = 273, n_tokens = 41, progress = 0.150183
slot update_slots: id  0 | task 318 | prompt done, n_past = 273, n_tokens = 41
slot      release: id  0 | task 318 | stop processing: n_past = 428, truncated = 0
slot print_timing: id  0 | task 318 | 
prompt eval time =      96.86 ms /    41 tokens (    2.36 ms per token,   423.31 tokens per second)
       eval time =    2860.68 ms /   156 tokens (   18.34 ms per token,    54.53 tokens per second)
      total time =    2957.54 ms /   197 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 475 | processing task
slot update_slots: id  0 | task 475 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 397
slot update_slots: id  0 | task 475 | kv cache rm [273, end)
slot update_slots: id  0 | task 475 | prompt processing progress, n_past = 397, n_tokens = 124, progress = 0.312343
slot update_slots: id  0 | task 475 | prompt done, n_past = 397, n_tokens = 124
terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '<think>Okay, the user asked for the weather in Paris, France. I called the get_weather function with the location set to "Paris, France" and units as "metric". The tool response came back with the weather data. Now I need to present this information in a clear and friendly way.

First, I'll check the response string to see the details. The date and time are "09:47:38+0200 06:01:51 21:33:35" which might be the time zone and other timestamps. The weather is "Overcast ↓9km/h +14°C". 

I should break this down. The temperature is 14°C, with a wind speed of 9 km/h. The weather condition is overcast. The time zone is +2 hours, so the local time is 2 hours ahead of UTC. 

I need to mention the temperature, wind speed, weather condition, and time zone. Also, make sure to mention that it's overcast and the wind is coming from the south. The user might want to know if they should carry an umbrella, but since it's overcast, maybe just note the conditions. 

I should structure the response with a friendly greeting, the current weather, and the time zone. Keep it concise but informative. Let me put that all together in a natural way.</think>The current weather in Paris, France is **Overcast** with a wind speed of **9 km/h** and a temperature of **+14°C**. The local time is' not found at start of '<think>Okay, the user asked for the weather in Paris, France. I called the get_weather function with the location set to "Paris, France" and units as "metric". The tool response came back with the weather data. Now I need to present this information in a clear and friendly way.

First, I'll check the response string to see the details. The date and time are "09:47:38+0200 06:01:51 21:33:35" which might be the time zone and other timestamps. The weather is "Overcast ↓9km/h +14°C". 

I should break this down. The temperature is 14°C, with a wind speed of 9 km/h. The weather condition is overcast. The time zone is +2 hours, so the local time is 2 hours ahead of UTC. 

I need to mention the temperature, wind speed, weather condition, and time zone. Also, make sure to mention that it's overcast and the wind is coming from the south. The user might want to know if they should carry an umbrella, but since it's overcast, maybe just note the conditions. 

I should structure the response with a friendly greeting, the current weather, and the time zone. Keep it concise but informative. Let me put that all together in a natural way.</think>'

Thread 1 "llama-server" received signal SIGABRT, Aborted.
Downloading 4.48 K source file /usr/src/debug/glibc/glibc/nptl/pthread_kill.c
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44                                                                                                                                                                                                       
44	     return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff54a7813 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:89
#2  0x00007ffff544ddc0 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff543557a in __GI_abort () at abort.c:73
#4  0x00007ffff5697bf8 in __gnu_cxx::__verbose_terminate_handler () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#5  0x00007ffff56b1c1a in __cxxabiv1::__terminate (handler=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
#6  0x00007ffff56975db in std::terminate () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
#7  0x00007ffff56b1ed6 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x555555a5e2c0 <typeinfo for std::runtime_error@GLIBCXX_3.4>, dest=0x7ffff56c99b0 <std::runtime_error::~runtime_error()>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
#8  0x00005555557e1917 in string_diff (last="<think>Okay, the user asked for the weather in Paris, France. I called the get_weather function with the location set to \"Paris, France\" and units as \"metric\". The tool response came back with the wea"..., 
    current="<think>Okay, the user asked for the weather in Paris, France. I called the get_weather function with the location set to \"Paris, France\" and units as \"metric\". The tool response came back with the wea"...) at /home/austin/.bin/cpp/llama.cpp/common/chat.cpp:34
#9  0x00005555557e2834 in common_chat_msg_diff::compute_diffs (previous_msg=..., new_msg=...) at /home/austin/.bin/cpp/llama.cpp/common/chat.cpp:86
#10 0x0000555555643e54 in server_slot::update_chat_msg (this=0x5555565291c0, diffs=std::vector of length 0, capacity 0) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:1423
#11 0x000055555564ea70 in server_context::send_partial_response (this=0x7fffffffc0c0, slot=..., tkn=...) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:2457
#12 0x000055555564d700 in server_context::process_token (this=0x7fffffffc0c0, result=..., slot=...) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:2255
#13 0x0000555555654c17 in server_context::update_slots (this=0x7fffffffc0c0) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:3426
#14 0x00005555555fc539 in operator() (__closure=0x7fffffffd680) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:4834
#15 0x000055555560a1c4 in std::__invoke_impl<void, main(int, char**)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/15.1.1/bits/invoke.h:63
#16 0x000055555560840e in std::__invoke_r<void, main(int, char**)::<lambda()>&>(struct {...} &) (__fn=...) at /usr/include/c++/15.1.1/bits/invoke.h:113
#17 0x00005555556044de in std::_Function_handler<void(), main(int, char**)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/15.1.1/bits/std_function.h:292
#18 0x000055555565b60c in std::function<void()>::operator() (this=0x7fffffffd680) at /usr/include/c++/15.1.1/bits/std_function.h:593
#19 0x000055555564675f in server_queue::start_loop (this=0x7fffffffd560) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:1690
#20 0x00005555555fed7e in main (argc=14, argv=0x7fffffffd918) at /home/austin/.bin/cpp/llama.cpp/tools/server/server.cpp:4859
(gdb) Quit
(gdb) quit
A debugging session is active.

	Inferior 1 [process 70237] will be killed.

Quit anyway? (y or n) y

Below is the output after a strict and proper reconstruction occurs from the streamed logits during SSE streaming.

Successful execution
16:21:39 | /mnt/valerie/public/agent
(.venv) git:(main | Δ) λ python -m agent
<user> Hello! My name is Austin. What is your name?
Thinking:
Okay, the user said, "Hello! My name is Austin. What is your name?" Let me think about how to respond.

First, the user introduced themselves as Austin and is asking my name. I need to provide a friendly response. Since my name is Qwen, I should mention that. But I should make it sound natural and not too robotic.

Maybe start with a greeting, then state my name. Also, offer assistance. Let me check if there's any function I need to call here. The tools provided include get_weather, but the user isn't asking about weather. They're just asking for my name. So no function call is needed here. Just a simple response.
Completion:
I'm Qwen! 👋 I'm here to help with any questions or tasks you have. How can I assist you today?
<user> What is the weather like in Paris, France today?
Thinking:
Okay, the user is asking about the weather in Paris, France today. Let me check the tools available. There's a function called get_weather that retrieves current weather for a location. The parameters required are location and units, with units having an enum of metric or uscs, defaulting to metric.

So, I need to call get_weather with location set to "Paris, France" and units as "metric" since the user didn't specify. The function will return the current weather data. Then I can present that information to the user in a friendly manner. Let me make sure the location is correctly formatted as "Paris, France" and the units are set to metric. No other parameters are needed. Alright, time to generate the tool call.
Completion:
Tool Call:
get_weather({'location': 'Paris, France', 'units': 'metric'}): Paris, France 22:22:31+0200 06:01:51 21:33:35 Light rain ↓11km/h +11°C
Thinking:
Okay, the user asked about the weather in Paris, France. I called the get_weather function with the location and units set to metric. The tool response came back with the weather data. Now I need to present this information in a user-friendly way.

Let me parse the tool_response. The time is 22:22:31+0200, which is 10:22 PM local time. The temperature is +11°C, and there's light rain with a wind speed of 11 km/h. The unit system is metric, so that's consistent. 

I should mention the current time, temperature, weather condition, and wind speed. Also, note the light rain and the direction (↓11km/h). Make sure to keep it concise but informative. Avoid technical jargon. Maybe start with the current time, then temperature, then weather description, and wind details. End with a friendly note asking if they need more info. 

Check for any errors in the data. The temperature is +11°C, which is a bit cool, but that's the data provided. The light rain is mentioned, so mention that. Wind speed is 11 km/h, which is a gentle breeze. 

Alright, putting it all together in a clear sentence structure.
Completion:
The current weather in Paris, France is **11°C** with **light rain**. The wind is blowing at **11 km/h** from the south (↓). It's a mild day with comfortable temperatures. Would you like to check another location? 😊
<user> 

NOTE: The trailing <user> indicates the program is waiting for user input to occur. If I attempt to continue with execution, the server still crashes with similar results and outputs.

<user> What should I wear for weather like this?
Thinking:
Okay, the user just asked, "What should I wear for weather like this?" After I provided the weather details for Paris, they're now asking about appropriate clo
thing based on the current conditions.

First, I need to recall the weather data from the previous response. The temperature is 11°C, there's light rain, and the wind is 11 km/h from the south. So, t
he weather is cool with light precipitation.

Now, the user wants to know what to wear. Let me think about typical clothing for such conditions. Since it's 11°C, which is a bit cold, they might need a ligh
t jacket or a sweater. Light rain means they should have waterproof gear. The wind speed is 11 km/h, so a windbreaker or a jacket with a hood could be useful.

I should mention layers, like a long-sleeve shirt and a jacket. Also, waterproof clothing like a raincoat or umbrella. Maybe suggest breathable fabrics to stay
 comfortable. Also, a hat and gloves if the wind is a factor. 

I should structure the answer clearly, starting with the main items, then accessories. Make sure to mention the temperature and weather conditions. Avoid any t
echnical terms. Keep it friendly and helpful. 

Check if there's any specific advice for the wind direction. The wind is from the south, so maybe suggest a wind barrier. But since the user didn't specify, ke
ep it general. 

Also, maybe add a note about checking the weather again if they're going out. But since they already asked about the current weather, maybe not necessary. 

Alright, time to put it all together in a coherent response.
Completion:
For the current weather in Paris (11°C with light rain andTraceback (most recent call last):
  File "/mnt/valerie/public/agent/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/mnt/valerie/public/agent/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 127, in __iter__
    for part in self._httpcore_stream:
                ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/public/agent/.venv/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py", line 407, in __iter__
    raise exc from None
  File "/mnt/valerie/public/agent/.venv/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py", line 403, in __iter__
    for part in self._stream:
                ^^^^^^^^^^^^
  File "/mnt/valerie/public/agent/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 342, in __iter__
    raise exc
  File "/mnt/valerie/public/agent/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 334, in __iter__
    for chunk in self._connection._receive_response_body(**kwargs):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/public/agent/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 203, in _receive_response_body
    event = self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/public/agent/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 213, in _receive_event
    with map_exceptions({h11.RemoteProtocolError: RemoteProtocolError}):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/austin/.local/share/python/3.12.8/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/mnt/valerie/public/agent/.venv/lib/python3.12/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

The above exception was the direct cause of the following exception:
# ...omitting remaining output for brevity

I'm still digging into this, but felt it might be worth sharing. Maybe someone else has run into similar issues and understands what's wrong; if it's me and what I'm doing, if it's just how the server is set up at the moment, or if it's just due to the instability of this PR.

Note that normally I would just roll my own hand-written API, but it's convenient to have this rolled up into one via OpenAI REST API due to less mental overhead, maintenance, etc. Also, I just wanted to experiment with this. It's nothing serious - just a toy project.

I'll be looking into this in more detail.

@pwilkin
Copy link
Contributor

pwilkin commented May 21, 2025

@teleprint-me It seems like that's exactly the same error that I got above.

@ochafik
Copy link
Collaborator Author

ochafik commented May 24, 2025

@teleprint-me, @pwilkin, thanks for the feedback! Would you be able to share a curl command similar to the one in the docs + --verbose output log of llama-server? Trying to repro :-)

@ochafik ochafik marked this pull request as ready for review May 24, 2025 09:04
@ericcurtin ericcurtin requested a review from Copilot May 24, 2025 15:18
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces streaming of tool calls and reasoning content when using the --jinja flag while also improving partial JSON parsing and proper handling of grammar triggers. Key changes include:

  • Addition of streaming request utilities and extensive updates to unit tests for tool call and chat completion behavior.
  • Integration of a unified chat syntax structure in server.cpp and changes in sampling.cpp to better handle grammar trigger patterns.
  • Major enhancements in the JSON partial parsing/healing logic and chat parser to recover truncated model outputs.

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tools/server/tests/utils.py Added a new make_any_request method to support both streaming and non-streaming calls.
tools/server/tests/unit/test_tool_call.py Updated test functions to support streaming and adjusted assertions for tool calls.
tools/server/tests/unit/test_chat_completion.py Modified stream assertion checks (using is None rather than empty string).
tools/server/server.cpp Integrated common_chat_syntax and refined grammar trigger handling and logging.
tests/test-json-partial.cpp Added comprehensive tests for partial JSON healing with various fragment examples.
tests/test-chat-parser.cpp Expanded test coverage for chat parsing, including reasoning and regex consumption.
src/llama-grammar.cpp Updated grammar trigger handling to capture the first non-empty capturing group.
scripts/tool_bench.py Enhanced command-line benchmarking parameters by adding chat_template_file support.
models/templates/README.md & Qwen-QwQ-32B.jinja Added documentation and a new Jinja template for Qwen-QwQ-32B.
docs/function-calling.md Included a caution regarding extreme KV quantizations affecting tool calling.
common/sampling.cpp Refactored grammar trigger pattern concatenation by removing patterns_at_start logic.
common/json-partial.* Introduced and refined JSON healing logic with detailed parsing and recovery steps.
common/chat*.{h,cpp} Updated chat structures and the parser to leverage the new chat syntax and reasoning.
Comments suppressed due to low confidence (2)

tools/server/tests/unit/test_tool_call.py:98

  • Consider removing or re-enabling the commented assertion for a non-empty tool call id if it is no longer applicable. This cleanup can help avoid confusion in test outputs and maintain code clarity.
    # assert len(tool_call.get("id", "") > 0, f'Expected non empty tool call id in {tool_call}')

common/sampling.cpp:164

  • [nitpick] The variable name 'trigger_patterns' (previously replacing 'patterns_at_start') could be made more descriptive. Consider renaming it (e.g. to 'grammar_trigger_patterns') for clarity and consistency with its usage later in the code.
    std::vector<std::string> trigger_patterns;

it = temptative_end;
return true;
} catch (const std::exception & ex) {
// No, needs healing.
Copy link
Preview

Copilot AI May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The healing logic for partial JSON parsing is quite complex. Adding detailed inline comments explaining the steps and decisions—especially the conditions under which different closing characters are appended—would improve maintainability and clarity.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

@ericcurtin ericcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a fan of merging this, even if it was imperfect, so the community can build upon it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes script Script related server testing Everything test related tool calling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: llama-cpp-deepseek-r1.jinja template will miss the <think> tag