Skip to content

Conversation

@levunet
Copy link
Contributor

@levunet levunet commented Sep 16, 2025

Purpose

This PR requires the following PRs to be merged first: #24768 and the harmony lib PR (openai/harmony#76).

The changes include adding 'with_recipient' and the Assistant's 'analysis' content.
Without adding this content, there was an issue where the gpt-oss model had a higher probability of outputting abnormal tokens when calling tools.

Test Plan

gpt-oss_test.py
messages.txt

Run python3 gpt-oss_test.py about 10 times.

Test Result

(before)
image

(after applying the harmony lib changes)
image

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a bug in gpt-oss model tool calls by adding 'analysis' content and with_recipient. The changes are logical and align with the stated purpose. However, I've identified a high-severity issue where the handling of 'analysis' content is not robust for multimodal inputs, which could lead to a runtime crash. I have provided a code suggestion to address this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to extract content is not robust. The content of an assistant message can be a list of parts (e.g., for multimodal inputs), not just a string. The current implementation content = chat_msg.get("content") or "" will cause a runtime error if content is a non-empty list, as Message.from_role_and_content expects a string. The code should handle the case where content is a list by extracting the text parts, similar to how it's handled for other message types in this file.

        content = chat_msg.get("content")
        if isinstance(content, list):
            # Extract text from multimodal content
            content = "\n".join(
                p.get("text", "") for p in content
                if isinstance(p, dict) and p.get("type") == "text")
        elif not isinstance(content, str):
            content = ""

        analysis_msg = Message.from_role_and_content(Role.ASSISTANT, content)
        analysis_msg = analysis_msg.with_channel("analysis")
        msgs.append(analysis_msg)

@levunet levunet force-pushed the feat/gptoss-tool-fix branch 2 times, most recently from c3780bf to eed8815 Compare September 16, 2025 09:19
@levunet
Copy link
Contributor Author

levunet commented Sep 16, 2025

@alecsolder
In the harmony library, the ' to' value should have been encoded as id 316, but occasionally it was incorrectly encoded as a mixture of 220+935 id values, which caused the model to output incorrect tokens.

@alecsolder
Copy link
Contributor

alecsolder commented Sep 18, 2025

In your test script, I see you're using the streaming completions endpoint, which I don't think uses the harmony_utils method you modified? I just want to double check I'm reading it right

@alecsolder
Copy link
Contributor

Also, thanks for mentioning that huggingface tokenizer change, I hadn't seen it and my snapshot was out of date!

@levunet
Copy link
Contributor Author

levunet commented Sep 19, 2025

Thank you for checking. I've double-checked the part you mentioned.
I can confirm that the parse_chat_input I modified in the harmony_utils.py file is being used without any issues through the flow: /v1/chat/completions -> def create_chat_completion -> def _make_request_with_harmony -> def parse_chat_input to parse request.messages.
The streaming endpoint is also correctly using this modified method.

@fouadbakkour
Copy link

When this PR will be merged guys?
we need this fix please.

@youkaichao
Copy link
Member

cc @chaunceyjiang @aarnphm

@levunet levunet force-pushed the feat/gptoss-tool-fix branch from 39d6e8e to e8e7579 Compare October 10, 2025 13:12
The changes include adding 'with_recipient' and the Assistant's 'analysis' content.
Without adding this content, there was an issue where the gpt-oss model had a higher probability of outputting abnormal tokens when calling tools.

Signed-off-by: kyt <eluban4532@gmail.com>
@levunet levunet force-pushed the feat/gptoss-tool-fix branch from e8e7579 to f13f45b Compare October 10, 2025 13:19
@levunet
Copy link
Contributor Author

levunet commented Oct 10, 2025

I was delayed today due to some issues while resolving git conflicts. The issue was caused by FlashInfer 0.4.0, which was added in recent commits, causing responses to repeat infinitely in the gpt-oss model. It took quite some time to find a solution.

VLLM_USE_FLASHINFER_SAMPLER=0

The problem occurred in the sampling process, and it works normally when the FlashInfer sampler is disabled.

@epark001
Copy link

I was delayed today due to some issues while resolving git conflicts. The issue was caused by FlashInfer 0.4.0, which was added in recent commits, causing responses to repeat infinitely in the gpt-oss model. It took quite some time to find a solution.

VLLM_USE_FLASHINFER_SAMPLER=0

The problem occurred in the sampling process, and it works normally when the FlashInfer sampler is disabled.

will the fix need both the VLLM_USE_FLASHINFER_SAMPLER=0 and this pr? or just the flashinfer=0?

@levunet
Copy link
Contributor Author

levunet commented Oct 10, 2025

VLLM_USE_FLASHINFER_SAMPLER=0

I think this setting is for resolving an issue that occurs in a completely different place from the current PR. After testing multiple times, it seems like the same problem exists even without the current PR, so I think this setting can be used separately.

@bbrowning bbrowning mentioned this pull request Oct 10, 2025
5 tasks
@ashgold
Copy link

ashgold commented Oct 21, 2025

Any progress here?
Many people in my department want to use the tool calling feature with gpt-oss, but it seems that feature isn't fully implemented in vLLM yet.

@levunet
Copy link
Contributor Author

levunet commented Oct 21, 2025

Unfortunately, the current changes alone are insufficient to fully resolve the bug - the changes from the harmony repo are essential. Since OpenAI hasn't reviewed that PR yet, should I propose pip installing from my fork branch in the meantime?..

@levunet
Copy link
Contributor Author

levunet commented Oct 21, 2025

@aarnphm @chaunceyjiang

This PR fixes a bug that occurred when using tool calling with gpt-oss models. However, it requires additional changes from the openai/harmony repo, which unfortunately hasn't been reviewed for 1 months now. Would it be acceptable to temporarily modify the requirements to install from my forked harmony branch until the upstream PR is merged?

@ashgold
Copy link

ashgold commented Oct 23, 2025

@aarnphm @chaunceyjiang

This PR fixes a bug that occurred when using tool calling with gpt-oss models. However, it requires additional changes from the openai/harmony repo, which unfortunately hasn't been reviewed for 1 months now. Would it be acceptable to temporarily modify the requirements to install from my forked harmony branch until the upstream PR is merged?

@levunet
Could you provide guidance on how to build using the modified version of the openai-harmony module?
I'd like to build the vLLM image with this PR and the modified openai-harmony module, then test the tool calling functionality.
I have experience building the vLLM v0.11.0 image in my environment.

@dr75
Copy link
Contributor

dr75 commented Oct 31, 2025

@levunet, could you give an example where tool calling does not work in v0.11.1? I couldn't find any so I wonder if this fix and the harmony fix is actually needed.

Basically #24768 makes tool calling work for me for chat completions in streaming mode.

However, there are two remaining issues:

  • using --tool-call-parser openai breaks normal non-streaming chat completions; so this flag must not be used right now (tool calling works without it)
  • tool parsing for non-streaming requests doesn't work in v0.11.0. This also doesn't work in v0.11.1 nor with the fix here and the harmony fix you mentioned.

@dr75
Copy link
Contributor

dr75 commented Oct 31, 2025

Ok, could reproduce the cases now with your script above and confirm that it still happens on v0.11.1 and is fixed with this PR and the harmony change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

7 participants