Skip to content

Conversation

@hdlj-h
Copy link
Contributor

@hdlj-h hdlj-h commented Dec 4, 2025

Purpose

This PR introduces is_reasoning_end_on_decode_step to the ReasoningParser interface, implements it in all reasoning parsers and use it in the StructuredOuputManager.

Previously, is_reasoning_end checked for the reasoning end token at every decode step by scanning
all input_ids for models that relies on a single token for end reasoning.

def is_reasoning_end(self, input_ids: list[int]) -> bool:
    end_token_id = self.end_token_id
    return any(input_id == end_token_id for input_id in reversed(input_ids))

This was called at every decoding step in the StructuredOutputManager and could be inefficient for models that end reasoning with a single token (long reasoning + structured output)

The new method checks only the last token of the current decoding step (BaseThinkingReasoningParser)

def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool:
    return len(input_ids) > 0 and self.end_token_id == input_ids[-1]

For multi tokens end reasoning (like gptoss), this PRs preserve the previous behaviour by calling is_reasoning_end and doesn't bring improvements.

After this PR: #30048. I will add the required changes to Holo2ReasoningParser.

Test Plan

  1. Add test for this new method interface in tests/reasoning/test_base_thinking_reasoning_parser.py
pytest tests/reasoning/test_base_thinking_reasoning_parser.py 
  1. Benchmark vLLM on Holo2 8B with and without the change on a scenario with long reasoning and structured output. We use extra-body to actvate the structured output manager and add a logit bias on to force the model to only think.
MODEL=holo2-8b vllm bench serve \
    --dataset-name 'random-mm' \
    --backend 'openai-chat' \
    --tokenizer 'Qwen/Qwen2-7B-Instruct' \
    --endpoint '/v1/chat/completions' \
    --ignore-eos \
    --metric-percentiles '25,50,75,90,95,99' \
    --percentile-metrics 'ttft,tpot,itl,e2el' \
    --goodput 'e2el:10000' \
    --save-detailed \
    --save-result \
    --extra-body '{ "guided_json": { "properties": {"city": {"title": "City", "type": "string"}}, "required": ["city"], "title": "Output", "type": "object" }, "logit_bias": { "151668": -100 } }' \
    --model $MODEL \
    --base-url http://$MODEL:8000 \
    --seed '1764348242' \
    --request-rate 'inf' \
    --random-input-len '7000' \
    --random-output-len '3000' \
    --max-concurrency '20' \
    --num-prompts '100'

Test Result

  1. BaseThinkingReasoningParser tests are passing
================================================================================== test session starts ===================================================================================
platform darwin -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/hubert/code/hdlj-vllm
configfile: pyproject.toml
plugins: anyio-4.11.0, asyncio-1.3.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 22 items                                                                                                                                                                       

tests/reasoning/test_base_thinking_reasoning_parser.py ......................      
  1. Bench results
    main
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             20
Benchmark duration (s):                  521.35
Total input tokens:                      700000
Total generated tokens:                  300000
Request throughput (req/s):              0.19
Request goodput (req/s):                 0.00
Output token throughput (tok/s):         575.43
Peak output token throughput (tok/s):    700.00
Peak concurrent requests:                26.00
Total Token throughput (tok/s):          1918.09
---------------Time to First Token----------------
Mean TTFT (ms):                          1562.85
Median TTFT (ms):                        1099.16
P25 TTFT (ms):                           1062.06
P50 TTFT (ms):                           1099.16
P75 TTFT (ms):                           1379.54
P90 TTFT (ms):                           3577.44
P95 TTFT (ms):                           4618.21
P99 TTFT (ms):                           5682.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.23
Median TPOT (ms):                        34.34
P25 TPOT (ms):                           34.04
P50 TPOT (ms):                           34.34
P75 TPOT (ms):                           34.45
P90 TPOT (ms):                           34.50
P95 TPOT (ms):                           34.59
P99 TPOT (ms):                           34.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.25
Median ITL (ms):                         33.13
P25 ITL (ms):                            30.91
P50 ITL (ms):                            33.13
P75 ITL (ms):                            35.33
P90 ITL (ms):                            36.58
P95 ITL (ms):                            37.27
P99 ITL (ms):                            42.25
----------------End-to-end Latency----------------
Mean E2EL (ms):                          104217.78
Median E2EL (ms):                        104375.43
P25 E2EL (ms):                           104054.62
P50 E2EL (ms):                           104375.43
P75 E2EL (ms):                           104696.93
P90 E2EL (ms):                           105622.35
P95 E2EL (ms):                           106714.51
P99 E2EL (ms):                           107775.61
==================================================

This PR:

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             20
Benchmark duration (s):                  291.45
Total input tokens:                      700000
Total generated tokens:                  300000
Request throughput (req/s):              0.34
Request goodput (req/s):                 0.00
Output token throughput (tok/s):         1029.32
Peak output token throughput (tok/s):    1240.00
Peak concurrent requests:                27.00
Total Token throughput (tok/s):          3431.06
---------------Time to First Token----------------
Mean TTFT (ms):                          1613.74
Median TTFT (ms):                        1290.23
P25 TTFT (ms):                           1032.16
P50 TTFT (ms):                           1290.23
P75 TTFT (ms):                           1540.27
P90 TTFT (ms):                           3590.01
P95 TTFT (ms):                           4826.16
P99 TTFT (ms):                           5603.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.88
Median TPOT (ms):                        18.89
P25 TPOT (ms):                           18.80
P50 TPOT (ms):                           18.89
P75 TPOT (ms):                           19.08
P90 TPOT (ms):                           19.18
P95 TPOT (ms):                           19.24
P99 TPOT (ms):                           19.31
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.96
Median ITL (ms):                         17.82
P25 ITL (ms):                            16.99
P50 ITL (ms):                            17.82
P75 ITL (ms):                            18.66
P90 ITL (ms):                            19.18
P95 ITL (ms):                            19.42
P99 ITL (ms):                            23.65
----------------End-to-end Latency----------------
Mean E2EL (ms):                          58247.72
Median E2EL (ms):                        58171.11
P25 E2EL (ms):                           57882.49
P50 E2EL (ms):                           58171.11
P75 E2EL (ms):                           58495.73
P90 E2EL (ms):                           60122.58
P95 E2EL (ms):                           61367.37
P99 E2EL (ms):                           62141.97
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Dec 4, 2025

Documentation preview: https://vllm--30056.org.readthedocs.build/en/30056/

@mergify mergify bot added documentation Improvements or additions to documentation deepseek Related to DeepSeek models structured-output labels Dec 4, 2025
@mergify mergify bot added the v1 label Dec 4, 2025
@hdlj-h hdlj-h force-pushed the add-decode-step-end-reasoning branch from 49dcbe7 to cfd844c Compare December 4, 2025 11:29
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance improvement for structured output with single-token reasoning endings by adding the is_reasoning_end_on_decode_step method. The changes are well-structured, with updates to the core interface, implementations for various parsers, corresponding tests, and documentation. The benchmark results clearly demonstrate the effectiveness of the optimization. My main feedback is a critical performance issue in the olmo3_reasoning_parser where a major optimization opportunity was missed, which I've detailed in a specific comment.

return self.think_end in text

def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool:
return self.is_reasoning_end(input_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling self.is_reasoning_end() here is highly inefficient as it decodes the entire input_ids sequence on every step. This negates the performance benefits of this PR for olmo3 models.

A much more performant implementation should be used that avoids decoding the full sequence by checking only a suffix of the input_ids. Since this method is called at each step to detect the first appearance of the end marker, checking a suffix is a safe and efficient heuristic.

Suggested change
return self.is_reasoning_end(input_ids)
# Avoid decoding the whole sequence by checking only a suffix.
suffix_len = 32
if len(input_ids) < suffix_len:
text_to_check = self.model_tokenizer.decode(input_ids)
else:
text_to_check = self.model_tokenizer.decode(input_ids[-suffix_len:])
return self.think_end in text_to_check

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@chaunceyjiang chaunceyjiang self-assigned this Dec 4, 2025
@chaunceyjiang
Copy link
Collaborator

Total Token throughput (tok/s): 1918.09
Total Token throughput (tok/s): 3431.06

Amazing! is_reasoning_end is a known issue, but I didn't expect it to have such a huge impact!!!

def is_reasoning_end(self, input_ids: list[int]) -> bool:
return self.end_token_id in input_ids

def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a **kwargs parameter to is_reasoning_end. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently known that there are some reasoning_parser plugins outside of vLLM. I'm concerned that this change might be too aggressive.


# Check if reasoning ends in *this* step
if self.reasoner.is_reasoning_end(request.all_token_ids):
if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):
Copy link
Collaborator

@chaunceyjiang chaunceyjiang Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):
if self.reasoner.is_reasoning_end(request.all_token_ids, setp="decode"):

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion ! If plugins can use this interface outside, I agree that adding a new method breaks them.
I have hesitating with your approach initially.

Either we provide more information to the is_reasoning_end function to be retro-compatible. However, by looking to the code, the same class is used for 2 different purposes: at the frontend level (extract_... + is_reasoning_ends) but also in the EngineCore process. Maybe another approach could be to have a dedicated class for the structured output (initiated per request) that let us do the same approach for multi token end reasoning too. WDYT ?

On the engine core side, we initiate the ReasoningParser -> get_structured_output_reasoning_checker() with a retro-compatibility check if the function is not implemented by the external plugins.

I will update my PR with your suggested changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either we provide more information to the is_reasoning_end function to be retro-compatible. However, by looking to the code, the same class is used for 2 different purposes: at the frontend level (extract_... + is_reasoning_ends) but also in the EngineCore process. Maybe another approach could be to have a dedicated class for the structured output (initiated per request) that let us do the same approach for multi token end reasoning too. WDYT ?

Yes, I completely agree with your point. In fact, I have previously tried to optimize is_reasoning_end, but I encountered a difficult issue: as you mentioned, is_reasoning_end is used in multiple places, especially in the frontend + streaming scenario. See #25735 (comment).

Given that is_reasoning_end now has such a significant impact on performance, I believe it's time to address this issue.

As I just mentioned, I'm concerned that overly aggressive changes could break downstream dependencies. I hope we can find a more backward-compatible way to implement the modifications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your answer, this design suggestion is more for a future version but indeed it will be breaking.

However if I add kwargs argument to the is_reasoning_end, external plugins will still be broken no ?

Otherway, we do a default implementation:

@abstractmethod
 def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool:
        """
        Check if the reasoning content ends in the input_ids on a
        decode step.

        It is used in structured engines like `xgrammar` to check if the
        reasoning content ends in the model output before applying the
        structured output.

        Notes:
            - The first time the reasoning content ends during a decode step, this
            method returns True. StructuredOutputManager then caches the result.
            - Subsequent decode steps for the same reasoning segment can return
            False or True.

        Parameters:
        input_ids: list[int]
            The input_ids of the model output at the current decode step.

        Returns:
        bool
            True if the reasoning content ends in the input_ids on a
            decode step.
        """
        return self.is_reasoning_end(input_ids)

WDYT ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go ahead


# Check if reasoning ends in *this* step
if self.reasoner.is_reasoning_end(request.all_token_ids):
if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):
if self.reasoner.is_reasoning_end(request.all_token_ids[request.num_computed_tokens:]):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdlj-h 🤔,Perhaps this should be handled this way. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to compute the newly generated tokens, not the tokens that have already been computed.

Copy link
Contributor Author

@hdlj-h hdlj-h Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that is also an option, but some reasoning parsers rely on multi-token end markers, so the assumption that all parsers are single-token–based does not always hold.

For example:
• OLMo3 operates in string space rather than token space. It may not be compatible with structured output for this reason.
• Granite also works in string space and does not implement is_reasoning_end, meaning it could crash if used with structured output.
• Hunyuan maintains an internal state machine for reasoning, which likely makes it incompatible with structured output + reasoning.

num_computed_tokens can be greater than 1 in the V1 ? or is it the output tokens counter ? If it is the case my PR is not bullet proof because I take the assumption that a decoding step == 1 generated token.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_computed_tokens can be greater than 1 in the V1 ?

Yes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLMo3, Granite, and Hunyuan — I think we can set these aside for now. They are inherently incompatible with structured output to begin with.

Therefore, we should ignore them for the time being and optimize them separately later.

Copy link
Contributor Author

@hdlj-h hdlj-h Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_computed_tokens can be greater than 1 in the V1 ?

Yes.

Ah I didn't take in account the speculative decoding case. So my PR is not compatible in such situation when num_computed_tokens > 1.

So I will revert my change and apply your approach instead that is a smaller change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaunceyjiang num_computed_tokens, I don't find the related docs, is it the number of computed tokens before the step or is it the number of computed tokens during the step ?

Copy link
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks~

Can you help test the benchmark again?

@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 4, 2025
@hdlj-h hdlj-h force-pushed the add-decode-step-end-reasoning branch from 5a6d588 to 7192b99 Compare December 4, 2025 16:31
if self.reasoner.is_reasoning_end(request.all_token_ids):
if self.reasoner.is_reasoning_end(
request.all_token_ids[request.num_computed_tokens :]
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The e2e test failed because gpt-oss uses multiple tokens in reasoning_end_token_ids for judgment.

This approach doesn't seem to work well either.

Currently, there are two solutions. Which one do you prefer?

  1. Special handling for gpt-oss?
  2. Use your initial approach—add an is_reasoning_end_streaming method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for detecting the issue gpt-oss. If we do a special handle for gpt-oss we could have the situation with other parsers that are multi tokens based.

For the 2nd approach, I see 2 possible interface with default implementation for backward compatibility:

is_reasoning_end_streaming(input_ids:list[int], diff: list[int])
is_reasoning_end_streaming(input_ids:list[int], num_computed_tokens: list[int])

The first signature is better I think because it could be re-usable for frontend purposes, but the second one avoid a list copy at each decode step but maybe it is negligeable. WDYT ?

@mergify
Copy link

mergify bot commented Dec 5, 2025

Documentation preview: https://vllm--30056.org.readthedocs.build/en/30056/

…endings.

Signed-off-by: hdlj-h <hubert@hcompany.ai>
Signed-off-by: hdlj-h <hubert@hcompany.ai>
Signed-off-by: hdlj-h <hubert@hcompany.ai>
…oding step

Signed-off-by: hdlj-h <hubert@hcompany.ai>
Signed-off-by: hdlj-h <hubert@hcompany.ai>
@hdlj-h hdlj-h force-pushed the add-decode-step-end-reasoning branch from bc56849 to c7d58f4 Compare December 5, 2025 13:45
@hdlj-h
Copy link
Contributor Author

hdlj-h commented Dec 5, 2025

@chaunceyjiang I have benchmarked the new approach but I have observed a 10% slow down in decoding throughput if we support interleaved reasoning/content in the model output. But this is not really supported by the structured output manager so I have changed the implementation to only check the end_token_id in the delta. I am building the new image and keep you posted with the new benchmark results

@hdlj-h hdlj-h force-pushed the add-decode-step-end-reasoning branch from 264a4dc to d876ed3 Compare December 5, 2025 16:52
@mergify
Copy link

mergify bot commented Dec 5, 2025

Hi @hdlj-h, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: hdlj-h <hubert@hcompany.ai>
@hdlj-h hdlj-h force-pushed the add-decode-step-end-reasoning branch from d876ed3 to ae851f4 Compare December 5, 2025 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed structured-output v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants