-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Bug]: handle tool calls in analysis channel #28139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug where tool calls in the analysis channel were not being handled by the gpt-oss streaming parser. The fix correctly extends the tool call handling logic to include the analysis channel, in addition to the commentary channel. This is a robust solution that aligns with the OpenAI Harmony documentation. The addition of a complex streaming test case that reproduces the issue is a great way to ensure the fix is effective. However, I have one concern regarding the resource requirements of the new test, which I've detailed in a specific comment.
| "--enforce-eager", | ||
| "--max-model-len", | ||
| "4096", | ||
| "40960", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increasing max-model-len to 40960 is a significant 10x increase. As you noted in the PR description, this large context window might not be acceptable for CI environments as it could lead to out-of-memory errors or significantly slow down test execution, potentially impacting CI stability. Have you considered mocking the model response or creating a more minimal, targeted unit test to verify this logic without requiring such a large context? This would make the test more robust and less resource-intensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug where tool calls in the analysis channel for gpt-oss models were not being handled correctly by the streaming parser. The proposed solution correctly extends the existing tool call parsing logic for the commentary channel to also include the analysis channel. The change in vllm/entrypoints/openai/serving_chat.py is well-targeted and correctly re-prioritizes the logic to check for tool calls in the analysis channel before treating it as reasoning content. A new complex streaming test case has been added in tests/entrypoints/openai/test_serving_chat.py, which effectively validates the fix. The changes are sound and I have not identified any issues of high or critical severity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @pytest.mark.asyncio | ||
| async def test_gpt_oss_chat_tool_call_streaming_complex(gptoss_client: OpenAI): | ||
| file_path = os.path.join(os.path.dirname(__file__), "tools_gpt-oss.json") | ||
| with open(file_path) as f: | ||
| request_body = json.load(f) | ||
|
|
||
| request_body["model"] = GPT_OSS_MODEL_NAME | ||
| request_body["extra_body"] = {"reasoning_effort": "low"} | ||
|
|
||
| stream = await gptoss_client.chat.completions.create(**request_body) | ||
|
|
||
| name = None | ||
| args_buf = "" | ||
| content_buf = "" | ||
| async for chunk in stream: | ||
| delta = chunk.choices[0].delta | ||
| if delta.tool_calls: | ||
| tc = delta.tool_calls[0] | ||
| if tc.function and tc.function.name: | ||
| name = tc.function.name | ||
| if tc.function and tc.function.arguments: | ||
| args_buf += tc.function.arguments | ||
| if getattr(delta, "content", None): | ||
| content_buf += delta.content | ||
|
|
||
| assert name is not None | ||
| assert len(args_buf) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complex streaming test ignores tool-parser parametrization
The newly added test_gpt_oss_chat_tool_call_streaming_complex uses the gptoss_client fixture, which is parametrized over with_tool_parser and therefore runs both with and without the --tool-call-parser flag. The test, however, always sends a request containing tools and unconditionally asserts that a tool name and arguments are returned. When the fixture runs with with_tool_parser=False, the server is started without tool parsing support, so the request will either fail or stream plain text instead of tool deltas, causing this test to fail. The test should either accept the with_tool_parser fixture and skip/assert appropriately when it is False, or drop the parametrization for this case.
Useful? React with 👍 / 👎.
|
Hrm I feel like the issue here is actually that we are putting these tool calls on the analysis channel in the first place. Do you know why that happens in the first place vs it always being on the commentary channel? I couldn't figure it out from a quick look at the code |
|
See the OpenAI docs and the limited docs and comments in the harmony repo. The problem is that you cannot force the model to put the recipient where it does, except you would use some guided decoding. For that reason, harmony allows for both. While usually it appears on the commentary channel, rendering previous tool calls to the analysis channel (which is the choice the harmony lib made - see comments) causes the model to switch as well. But in any case it wouldn't be a solution for every case as internal tools seem to be emitted on the analysis channel (see open ai docs). As long as you don't get a different answer from OpenAI on the respective Harmony PR, this fix here seems to be the right solution to me. And I guess other frameworks are doing it the same way. But I didn't check. |
|
BTW I completely agree this is a problem in general, I just want to make sure we solve it the right way.
Hrm I'm a bit confused, on the input path, the channel a tool call is rendered onto is decided by our code, which should always be commentary ATM. The more nefarious thing here is that any message to the I think for now the most reasonable way to do it is that on the input path, if the tool name starts with I think if recipient is set on the harmony message on the output path, it is fair to treat it as a tool call no matter what, doesn't matter which channel it is on for sure though |
You are right and my description was not really correct mixing the rendering with the output. The rendering "issue" (not sure if it's really an issue) is that harmony renders the recipient before the channel token. So when we use a commentary message, it will first render the recipient and then the channel token: This seems what leads to the model switching to output tools in the analysis channel (well I am speculating here, so maybe that's not right) and what this PR is trying to solve. However, even if we could change that, tools may still appear in any channel and we would want to handle them in any case. So it doesn't affect this PR I would say and I am only attempting to explain what might be happening here. Also, I don't know if such a change to the rendering is a good idea as it may affect model performance (assuming harmony was used during training; maybe negligible impact maybe not) and prefix caching (if the model constantly outputs x and harmony renders y then the cache is gone from that location). So a decision whether to change it is more complex and should take that into account.
Didn't know that, but I think
Sounds reasonable to me.
Makes sense! Do you mean changing the solution here or keeping as is? Also see the code for non-streaming in OpenAIToolParser, which is what you describe. |
|
Based on my observations from conducting various tests, although I'm not certain, I believe that built-in tool calls and function tool calls are distinguished by the tool invocations written before and after 'channel'. For example, the structure '<|start|>assistant to=' calls built-in tools, while the structure '<|channel|>commentary to=' proceeds with function tool calls. It appears that confusion occurs when function tools are provided to the gpt-oss model using the built-in tool call structure. |
|
Ah, thanks @levunet, that might explain things more: the built-in tools are in the analysis channel (as per docs) and usually done with tool invocation before channel as you found out. If now harmony renders all tool calls that way without distinguishing, then the model may also switch to invocation before channel and because of that then also switches to emitting them in the analysis channel. That would mean the ideal tool rendering (for the current model) would also distinguish between the channel and place the recipient accordingly and that way keep the output from the model as is (avoiding prefix cache cut-off for that message) and avoid the model confusion. Probably not a very import fix but could be useful. Given that it only affects tool calls, the impact on model performance might be very small or not there. In vLLM we might then also make this distinction and generate analysis messages for the internal tools (being all commentary atm I assume). |
Purpose
The gpt-oss streaming parser does not handle tool calls in the analysis channel. However, those calls can happen as explained here.
This especially happens when harmony renders the recipient in previous tool calls
in the analysis channelbefore the channel token,convincing"confusing" the model which returns tool calls in the analysis channel. However, the harmony parser is built to handle both cases, such that the vLLM parser should also support both.The fix proposed in this PR aims at changing the message rendering in harmony (for converting the history to tokens) to avoid confusing the model. While this fixes the issue it seems to not address the root cause: even with such a change, the model may emit tool calls in the analysis channel as described in the OpenAI docs.
To solve this, I propose to handle also tool calls in the analysis channel.
Test Plan
TODO: Added an automated test using an actual request that reliably fails without the fix but this requires a large context window and I assume that might be not acceptable for running those tests. Not sure how to handle this. Reducing the request size doesn't work. Ideally this would be unit tested, but this seems quite some effort given the current structure of the code.