-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Structured Output][Reasoning] Improves decoding throughput for models using single-token reasoning endings. #30056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Documentation preview: https://vllm--30056.org.readthedocs.build/en/30056/ |
49dcbe7 to
cfd844c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant performance improvement for structured output with single-token reasoning endings by adding the is_reasoning_end_on_decode_step method. The changes are well-structured, with updates to the core interface, implementations for various parsers, corresponding tests, and documentation. The benchmark results clearly demonstrate the effectiveness of the optimization. My main feedback is a critical performance issue in the olmo3_reasoning_parser where a major optimization opportunity was missed, which I've detailed in a specific comment.
| return self.think_end in text | ||
|
|
||
| def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool: | ||
| return self.is_reasoning_end(input_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling self.is_reasoning_end() here is highly inefficient as it decodes the entire input_ids sequence on every step. This negates the performance benefits of this PR for olmo3 models.
A much more performant implementation should be used that avoids decoding the full sequence by checking only a suffix of the input_ids. Since this method is called at each step to detect the first appearance of the end marker, checking a suffix is a safe and efficient heuristic.
| return self.is_reasoning_end(input_ids) | |
| # Avoid decoding the whole sequence by checking only a suffix. | |
| suffix_len = 32 | |
| if len(input_ids) < suffix_len: | |
| text_to_check = self.model_tokenizer.decode(input_ids) | |
| else: | |
| text_to_check = self.model_tokenizer.decode(input_ids[-suffix_len:]) | |
| return self.think_end in text_to_check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Amazing! |
docs/features/reasoning_outputs.md
Outdated
| def is_reasoning_end(self, input_ids: list[int]) -> bool: | ||
| return self.end_token_id in input_ids | ||
|
|
||
| def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding a **kwargs parameter to is_reasoning_end. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's currently known that there are some reasoning_parser plugins outside of vLLM. I'm concerned that this change might be too aggressive.
|
|
||
| # Check if reasoning ends in *this* step | ||
| if self.reasoner.is_reasoning_end(request.all_token_ids): | ||
| if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids): | |
| if self.reasoner.is_reasoning_end(request.all_token_ids, setp="decode"): |
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion ! If plugins can use this interface outside, I agree that adding a new method breaks them.
I have hesitating with your approach initially.
Either we provide more information to the is_reasoning_end function to be retro-compatible. However, by looking to the code, the same class is used for 2 different purposes: at the frontend level (extract_... + is_reasoning_ends) but also in the EngineCore process. Maybe another approach could be to have a dedicated class for the structured output (initiated per request) that let us do the same approach for multi token end reasoning too. WDYT ?
On the engine core side, we initiate the ReasoningParser -> get_structured_output_reasoning_checker() with a retro-compatibility check if the function is not implemented by the external plugins.
I will update my PR with your suggested changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either we provide more information to the is_reasoning_end function to be retro-compatible. However, by looking to the code, the same class is used for 2 different purposes: at the frontend level (extract_... + is_reasoning_ends) but also in the EngineCore process. Maybe another approach could be to have a dedicated class for the structured output (initiated per request) that let us do the same approach for multi token end reasoning too. WDYT ?
Yes, I completely agree with your point. In fact, I have previously tried to optimize is_reasoning_end, but I encountered a difficult issue: as you mentioned, is_reasoning_end is used in multiple places, especially in the frontend + streaming scenario. See #25735 (comment).
Given that is_reasoning_end now has such a significant impact on performance, I believe it's time to address this issue.
As I just mentioned, I'm concerned that overly aggressive changes could break downstream dependencies. I hope we can find a more backward-compatible way to implement the modifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your answer, this design suggestion is more for a future version but indeed it will be breaking.
However if I add kwargs argument to the is_reasoning_end, external plugins will still be broken no ?
Otherway, we do a default implementation:
@abstractmethod
def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool:
"""
Check if the reasoning content ends in the input_ids on a
decode step.
It is used in structured engines like `xgrammar` to check if the
reasoning content ends in the model output before applying the
structured output.
Notes:
- The first time the reasoning content ends during a decode step, this
method returns True. StructuredOutputManager then caches the result.
- Subsequent decode steps for the same reasoning segment can return
False or True.
Parameters:
input_ids: list[int]
The input_ids of the model output at the current decode step.
Returns:
bool
True if the reasoning content ends in the input_ids on a
decode step.
"""
return self.is_reasoning_end(input_ids)
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go ahead
|
|
||
| # Check if reasoning ends in *this* step | ||
| if self.reasoner.is_reasoning_end(request.all_token_ids): | ||
| if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids): | |
| if self.reasoner.is_reasoning_end(request.all_token_ids[request.num_computed_tokens:]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hdlj-h 🤔,Perhaps this should be handled this way. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to compute the newly generated tokens, not the tokens that have already been computed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that is also an option, but some reasoning parsers rely on multi-token end markers, so the assumption that all parsers are single-token–based does not always hold.
For example:
• OLMo3 operates in string space rather than token space. It may not be compatible with structured output for this reason.
• Granite also works in string space and does not implement is_reasoning_end, meaning it could crash if used with structured output.
• Hunyuan maintains an internal state machine for reasoning, which likely makes it incompatible with structured output + reasoning.
num_computed_tokens can be greater than 1 in the V1 ? or is it the output tokens counter ? If it is the case my PR is not bullet proof because I take the assumption that a decoding step == 1 generated token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_computed_tokens can be greater than 1 in the V1 ?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OLMo3, Granite, and Hunyuan — I think we can set these aside for now. They are inherently incompatible with structured output to begin with.
Therefore, we should ignore them for the time being and optimize them separately later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_computed_tokens can be greater than 1 in the V1 ?
Yes.
Ah I didn't take in account the speculative decoding case. So my PR is not compatible in such situation when num_computed_tokens > 1.
So I will revert my change and apply your approach instead that is a smaller change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chaunceyjiang num_computed_tokens, I don't find the related docs, is it the number of computed tokens before the step or is it the number of computed tokens during the step ?
chaunceyjiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks~
Can you help test the benchmark again?
5a6d588 to
7192b99
Compare
| if self.reasoner.is_reasoning_end(request.all_token_ids): | ||
| if self.reasoner.is_reasoning_end( | ||
| request.all_token_ids[request.num_computed_tokens :] | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The e2e test failed because gpt-oss uses multiple tokens in reasoning_end_token_ids for judgment.
This approach doesn't seem to work well either.
Currently, there are two solutions. Which one do you prefer?
- Special handling for gpt-oss?
- Use your initial approach—add an
is_reasoning_end_streamingmethod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for detecting the issue gpt-oss. If we do a special handle for gpt-oss we could have the situation with other parsers that are multi tokens based.
For the 2nd approach, I see 2 possible interface with default implementation for backward compatibility:
is_reasoning_end_streaming(input_ids:list[int], diff: list[int])
is_reasoning_end_streaming(input_ids:list[int], num_computed_tokens: list[int])
The first signature is better I think because it could be re-usable for frontend purposes, but the second one avoid a list copy at each decode step but maybe it is negligeable. WDYT ?
|
Documentation preview: https://vllm--30056.org.readthedocs.build/en/30056/ |
…endings. Signed-off-by: hdlj-h <hubert@hcompany.ai>
Signed-off-by: hdlj-h <hubert@hcompany.ai>
Signed-off-by: hdlj-h <hubert@hcompany.ai>
…oding step Signed-off-by: hdlj-h <hubert@hcompany.ai>
Signed-off-by: hdlj-h <hubert@hcompany.ai>
bc56849 to
c7d58f4
Compare
|
@chaunceyjiang I have benchmarked the new approach but I have observed a 10% slow down in decoding throughput if we support interleaved reasoning/content in the model output. But this is not really supported by the structured output manager so I have changed the implementation to only check the end_token_id in the delta. I am building the new image and keep you posted with the new benchmark results |
264a4dc to
d876ed3
Compare
|
Hi @hdlj-h, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Signed-off-by: hdlj-h <hubert@hcompany.ai>
d876ed3 to
ae851f4
Compare
Purpose
This PR introduces
is_reasoning_end_on_decode_stepto theReasoningParserinterface, implements it in all reasoning parsers and use it in the StructuredOuputManager.Previously,
is_reasoning_endchecked for the reasoning end token at every decode step by scanningall
input_idsfor models that relies on a single token for end reasoning.This was called at every decoding step in the StructuredOutputManager and could be inefficient for models that end reasoning with a single token (long reasoning + structured output)
The new method checks only the last token of the current decoding step (BaseThinkingReasoningParser)
For multi tokens end reasoning (like gptoss), this PRs preserve the previous behaviour by calling
is_reasoning_endand doesn't bring improvements.After this PR: #30048. I will add the required changes to
Holo2ReasoningParser.Test Plan
tests/reasoning/test_base_thinking_reasoning_parser.pyextra-bodyto actvate the structured output manager and add a logit bias on to force the model to only think.Test Result
main
This PR:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.