[Structured Output][Reasoning] Improves decoding throughput for models using single-token reasoning endings. #30056

hdlj-h · 2025-12-04T11:28:07Z

Purpose

This PR introduces is_reasoning_end_on_decode_step to the ReasoningParser interface, implements it in all reasoning parsers and use it in the StructuredOuputManager.

Previously, is_reasoning_end checked for the reasoning end token at every decode step by scanning
all input_ids for models that relies on a single token for end reasoning.

def is_reasoning_end(self, input_ids: list[int]) -> bool:
    end_token_id = self.end_token_id
    return any(input_id == end_token_id for input_id in reversed(input_ids))

This was called at every decoding step in the StructuredOutputManager and could be inefficient for models that end reasoning with a single token (long reasoning + structured output)

The new method checks only the last token of the current decoding step (BaseThinkingReasoningParser)

def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool:
    return len(input_ids) > 0 and self.end_token_id == input_ids[-1]

For multi tokens end reasoning (like gptoss), this PRs preserve the previous behaviour by calling is_reasoning_end and doesn't bring improvements.

After this PR: #30048. I will add the required changes to Holo2ReasoningParser.

Test Plan

Add test for this new method interface in tests/reasoning/test_base_thinking_reasoning_parser.py

pytest tests/reasoning/test_base_thinking_reasoning_parser.py

Benchmark vLLM on Holo2 8B with and without the change on a scenario with long reasoning and structured output. We use extra-body to actvate the structured output manager and add a logit bias on to force the model to only think.

MODEL=holo2-8b vllm bench serve \
    --dataset-name 'random-mm' \
    --backend 'openai-chat' \
    --tokenizer 'Qwen/Qwen2-7B-Instruct' \
    --endpoint '/v1/chat/completions' \
    --ignore-eos \
    --metric-percentiles '25,50,75,90,95,99' \
    --percentile-metrics 'ttft,tpot,itl,e2el' \
    --goodput 'e2el:10000' \
    --save-detailed \
    --save-result \
    --extra-body '{ "guided_json": { "properties": {"city": {"title": "City", "type": "string"}}, "required": ["city"], "title": "Output", "type": "object" }, "logit_bias": { "151668": -100 } }' \
    --model $MODEL \
    --base-url http://$MODEL:8000 \
    --seed '1764348242' \
    --request-rate 'inf' \
    --random-input-len '7000' \
    --random-output-len '3000' \
    --max-concurrency '20' \
    --num-prompts '100'

Test Result

BaseThinkingReasoningParser tests are passing

================================================================================== test session starts ===================================================================================
platform darwin -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/hubert/code/hdlj-vllm
configfile: pyproject.toml
plugins: anyio-4.11.0, asyncio-1.3.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 22 items                                                                                                                                                                       

tests/reasoning/test_base_thinking_reasoning_parser.py ......................

Bench results
main

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             20
Benchmark duration (s):                  521.35
Total input tokens:                      700000
Total generated tokens:                  300000
Request throughput (req/s):              0.19
Request goodput (req/s):                 0.00
Output token throughput (tok/s):         575.43
Peak output token throughput (tok/s):    700.00
Peak concurrent requests:                26.00
Total Token throughput (tok/s):          1918.09
---------------Time to First Token----------------
Mean TTFT (ms):                          1562.85
Median TTFT (ms):                        1099.16
P25 TTFT (ms):                           1062.06
P50 TTFT (ms):                           1099.16
P75 TTFT (ms):                           1379.54
P90 TTFT (ms):                           3577.44
P95 TTFT (ms):                           4618.21
P99 TTFT (ms):                           5682.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.23
Median TPOT (ms):                        34.34
P25 TPOT (ms):                           34.04
P50 TPOT (ms):                           34.34
P75 TPOT (ms):                           34.45
P90 TPOT (ms):                           34.50
P95 TPOT (ms):                           34.59
P99 TPOT (ms):                           34.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.25
Median ITL (ms):                         33.13
P25 ITL (ms):                            30.91
P50 ITL (ms):                            33.13
P75 ITL (ms):                            35.33
P90 ITL (ms):                            36.58
P95 ITL (ms):                            37.27
P99 ITL (ms):                            42.25
----------------End-to-end Latency----------------
Mean E2EL (ms):                          104217.78
Median E2EL (ms):                        104375.43
P25 E2EL (ms):                           104054.62
P50 E2EL (ms):                           104375.43
P75 E2EL (ms):                           104696.93
P90 E2EL (ms):                           105622.35
P95 E2EL (ms):                           106714.51
P99 E2EL (ms):                           107775.61
==================================================

This PR:

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             20
Benchmark duration (s):                  291.45
Total input tokens:                      700000
Total generated tokens:                  300000
Request throughput (req/s):              0.34
Request goodput (req/s):                 0.00
Output token throughput (tok/s):         1029.32
Peak output token throughput (tok/s):    1240.00
Peak concurrent requests:                27.00
Total Token throughput (tok/s):          3431.06
---------------Time to First Token----------------
Mean TTFT (ms):                          1613.74
Median TTFT (ms):                        1290.23
P25 TTFT (ms):                           1032.16
P50 TTFT (ms):                           1290.23
P75 TTFT (ms):                           1540.27
P90 TTFT (ms):                           3590.01
P95 TTFT (ms):                           4826.16
P99 TTFT (ms):                           5603.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.88
Median TPOT (ms):                        18.89
P25 TPOT (ms):                           18.80
P50 TPOT (ms):                           18.89
P75 TPOT (ms):                           19.08
P90 TPOT (ms):                           19.18
P95 TPOT (ms):                           19.24
P99 TPOT (ms):                           19.31
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.96
Median ITL (ms):                         17.82
P25 ITL (ms):                            16.99
P50 ITL (ms):                            17.82
P75 ITL (ms):                            18.66
P90 ITL (ms):                            19.18
P95 ITL (ms):                            19.42
P99 ITL (ms):                            23.65
----------------End-to-end Latency----------------
Mean E2EL (ms):                          58247.72
Median E2EL (ms):                        58171.11
P25 E2EL (ms):                           57882.49
P50 E2EL (ms):                           58171.11
P75 E2EL (ms):                           58495.73
P90 E2EL (ms):                           60122.58
P95 E2EL (ms):                           61367.37
P99 E2EL (ms):                           62141.97
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-12-04T11:28:46Z

Documentation preview: https://vllm--30056.org.readthedocs.build/en/30056/

gemini-code-assist

Code Review

This pull request introduces a significant performance improvement for structured output with single-token reasoning endings by adding the is_reasoning_end_on_decode_step method. The changes are well-structured, with updates to the core interface, implementations for various parsers, corresponding tests, and documentation. The benchmark results clearly demonstrate the effectiveness of the optimization. My main feedback is a critical performance issue in the olmo3_reasoning_parser where a major optimization opportunity was missed, which I've detailed in a specific comment.

gemini-code-assist · 2025-12-04T11:30:55Z

vllm/reasoning/olmo3_reasoning_parser.py

        return self.think_end in text

+    def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool:
+        return self.is_reasoning_end(input_ids)


Calling self.is_reasoning_end() here is highly inefficient as it decodes the entire input_ids sequence on every step. This negates the performance benefits of this PR for olmo3 models.

A much more performant implementation should be used that avoids decoding the full sequence by checking only a suffix of the input_ids. Since this method is called at each step to detect the first appearance of the end marker, checking a suffix is a safe and efficient heuristic.

Suggested change

return self.is_reasoning_end(input_ids)

# Avoid decoding the whole sequence by checking only a suffix.

suffix_len = 32

if len(input_ids) < suffix_len:

text_to_check = self.model_tokenizer.decode(input_ids)

else:

text_to_check = self.model_tokenizer.decode(input_ids[-suffix_len:])

return self.think_end in text_to_check

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/structured_output/__init__.py

chaunceyjiang · 2025-12-04T12:39:20Z

Total Token throughput (tok/s): 1918.09
Total Token throughput (tok/s): 3431.06

Amazing! is_reasoning_end is a known issue, but I didn't expect it to have such a huge impact!!!

chaunceyjiang · 2025-12-04T12:42:10Z

docs/features/reasoning_outputs.md

        def is_reasoning_end(self, input_ids: list[int]) -> bool:
            return self.end_token_id in input_ids
+
+        def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool:


I suggest adding a **kwargs parameter to is_reasoning_end. What do you think?

It's currently known that there are some reasoning_parser plugins outside of vLLM. I'm concerned that this change might be too aggressive.

chaunceyjiang · 2025-12-04T12:43:30Z

vllm/v1/structured_output/__init__.py


        # Check if reasoning ends in *this* step
-        if self.reasoner.is_reasoning_end(request.all_token_ids):
+        if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):


Suggested change

if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):

if self.reasoner.is_reasoning_end(request.all_token_ids, setp="decode"):

WDYT?

Thanks for the suggestion ! If plugins can use this interface outside, I agree that adding a new method breaks them.
I have hesitating with your approach initially.

Either we provide more information to the is_reasoning_end function to be retro-compatible. However, by looking to the code, the same class is used for 2 different purposes: at the frontend level (extract_... + is_reasoning_ends) but also in the EngineCore process. Maybe another approach could be to have a dedicated class for the structured output (initiated per request) that let us do the same approach for multi token end reasoning too. WDYT ?

On the engine core side, we initiate the ReasoningParser -> get_structured_output_reasoning_checker() with a retro-compatibility check if the function is not implemented by the external plugins.

I will update my PR with your suggested changes.

Either we provide more information to the is_reasoning_end function to be retro-compatible. However, by looking to the code, the same class is used for 2 different purposes: at the frontend level (extract_... + is_reasoning_ends) but also in the EngineCore process. Maybe another approach could be to have a dedicated class for the structured output (initiated per request) that let us do the same approach for multi token end reasoning too. WDYT ?

Yes, I completely agree with your point. In fact, I have previously tried to optimize is_reasoning_end, but I encountered a difficult issue: as you mentioned, is_reasoning_end is used in multiple places, especially in the frontend + streaming scenario. See #25735 (comment).

Given that is_reasoning_end now has such a significant impact on performance, I believe it's time to address this issue.

As I just mentioned, I'm concerned that overly aggressive changes could break downstream dependencies. I hope we can find a more backward-compatible way to implement the modifications.

Thanks for your answer, this design suggestion is more for a future version but indeed it will be breaking.

However if I add kwargs argument to the is_reasoning_end, external plugins will still be broken no ?

Otherway, we do a default implementation:

@abstractmethod def is_reasoning_end_on_decode_step(self, input_ids: list[int]) -> bool: """ Check if the reasoning content ends in the input_ids on a decode step. It is used in structured engines like `xgrammar` to check if the reasoning content ends in the model output before applying the structured output. Notes: - The first time the reasoning content ends during a decode step, this method returns True. StructuredOutputManager then caches the result. - Subsequent decode steps for the same reasoning segment can return False or True. Parameters: input_ids: list[int] The input_ids of the model output at the current decode step. Returns: bool True if the reasoning content ends in the input_ids on a decode step. """ return self.is_reasoning_end(input_ids)

WDYT ?

chaunceyjiang · 2025-12-04T15:16:38Z

vllm/v1/structured_output/__init__.py


        # Check if reasoning ends in *this* step
-        if self.reasoner.is_reasoning_end(request.all_token_ids):
+        if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):


Suggested change

if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):

if self.reasoner.is_reasoning_end(request.all_token_ids[request.num_computed_tokens:]):

@hdlj-h 🤔，Perhaps this should be handled this way. WDYT?

We only need to compute the newly generated tokens, not the tokens that have already been computed.

Indeed, that is also an option, but some reasoning parsers rely on multi-token end markers, so the assumption that all parsers are single-token–based does not always hold.

For example:
• OLMo3 operates in string space rather than token space. It may not be compatible with structured output for this reason.
• Granite also works in string space and does not implement is_reasoning_end, meaning it could crash if used with structured output.
• Hunyuan maintains an internal state machine for reasoning, which likely makes it incompatible with structured output + reasoning.

num_computed_tokens can be greater than 1 in the V1 ? or is it the output tokens counter ? If it is the case my PR is not bullet proof because I take the assumption that a decoding step == 1 generated token.

num_computed_tokens can be greater than 1 in the V1 ?

Yes.

OLMo3, Granite, and Hunyuan — I think we can set these aside for now. They are inherently incompatible with structured output to begin with.

Therefore, we should ignore them for the time being and optimize them separately later.

num_computed_tokens can be greater than 1 in the V1 ?

Yes.

Ah I didn't take in account the speculative decoding case. So my PR is not compatible in such situation when num_computed_tokens > 1.

So I will revert my change and apply your approach instead that is a smaller change.

@chaunceyjiang num_computed_tokens, I don't find the related docs, is it the number of computed tokens before the step or is it the number of computed tokens during the step ?

chaunceyjiang

Thanks~

Can you help test the benchmark again?

chaunceyjiang · 2025-12-05T03:18:27Z

vllm/v1/structured_output/__init__.py

-        if self.reasoner.is_reasoning_end(request.all_token_ids):
+        if self.reasoner.is_reasoning_end(
+            request.all_token_ids[request.num_computed_tokens :]
+        ):


The e2e test failed because gpt-oss uses multiple tokens in reasoning_end_token_ids for judgment.

This approach doesn't seem to work well either.

Currently, there are two solutions. Which one do you prefer?

Special handling for gpt-oss?

Use your initial approach—add an is_reasoning_end_streaming method?

Thanks for detecting the issue gpt-oss. If we do a special handle for gpt-oss we could have the situation with other parsers that are multi tokens based.

For the 2nd approach, I see 2 possible interface with default implementation for backward compatibility:

is_reasoning_end_streaming(input_ids:list[int], diff: list[int])

is_reasoning_end_streaming(input_ids:list[int], num_computed_tokens: list[int])

The first signature is better I think because it could be re-usable for frontend purposes, but the second one avoid a list copy at each decode step but maybe it is negligeable. WDYT ?

mergify · 2025-12-05T10:52:28Z

Documentation preview: https://vllm--30056.org.readthedocs.build/en/30056/

…endings. Signed-off-by: hdlj-h <hubert@hcompany.ai>

Signed-off-by: hdlj-h <hubert@hcompany.ai>

…oding step Signed-off-by: hdlj-h <hubert@hcompany.ai>

Signed-off-by: hdlj-h <hubert@hcompany.ai>

hdlj-h · 2025-12-05T16:48:14Z

@chaunceyjiang I have benchmarked the new approach but I have observed a 10% slow down in decoding throughput if we support interleaved reasoning/content in the model output. But this is not really supported by the structured output manager so I have changed the implementation to only check the end_token_id in the delta. I am building the new image and keep you posted with the new benchmark results

mergify · 2025-12-05T16:56:29Z

Hi @hdlj-h, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: hdlj-h <hubert@hcompany.ai>

hdlj-h requested review from aarnphm, benchislett, chaunceyjiang, mgoin and russellb as code owners December 4, 2025 11:28

mergify bot added documentation Improvements or additions to documentation deepseek Related to DeepSeek models structured-output labels Dec 4, 2025

github-project-automation bot added this to Structured Output Dec 4, 2025

mergify bot added the v1 label Dec 4, 2025

hdlj-h force-pushed the add-decode-step-end-reasoning branch from 49dcbe7 to cfd844c Compare December 4, 2025 11:29

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 4, 2025

View reviewed changes

vllm/v1/structured_output/__init__.py Show resolved Hide resolved

chaunceyjiang self-assigned this Dec 4, 2025

chaunceyjiang reviewed Dec 4, 2025

View reviewed changes

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 4, 2025

hdlj-h force-pushed the add-decode-step-end-reasoning branch from 5a6d588 to 7192b99 Compare December 4, 2025 16:31

chaunceyjiang reviewed Dec 5, 2025

View reviewed changes

hdlj-h added 6 commits December 5, 2025 14:45

Improves decoding throughput for models using single-token reasoning …

a83f6ac

…endings. Signed-off-by: hdlj-h <hubert@hcompany.ai>

Add new logic to Minimax_m2 reasoning parser

629a319

Signed-off-by: hdlj-h <hubert@hcompany.ai>

Use default implementation for new interface

f175316

Signed-off-by: hdlj-h <hubert@hcompany.ai>

Simplify PR change and support for multi token generations during dec…

a312816

…oding step Signed-off-by: hdlj-h <hubert@hcompany.ai>

Fix test

a2f7ca7

Signed-off-by: hdlj-h <hubert@hcompany.ai>

Add interface with is_reasoning_end_streaming for reasoning parsers

c7d58f4

Signed-off-by: hdlj-h <hubert@hcompany.ai>

hdlj-h force-pushed the add-decode-step-end-reasoning branch from bc56849 to c7d58f4 Compare December 5, 2025 13:45

hdlj-h force-pushed the add-decode-step-end-reasoning branch from 264a4dc to d876ed3 Compare December 5, 2025 16:52

Check only end_token_id during streaming

ae851f4

Signed-off-by: hdlj-h <hubert@hcompany.ai>

hdlj-h force-pushed the add-decode-step-end-reasoning branch from d876ed3 to ae851f4 Compare December 5, 2025 17:24

-        return self.is_reasoning_end(input_ids)
+        # Avoid decoding the whole sequence by checking only a suffix.
+        suffix_len = 32
+        if len(input_ids) < suffix_len:
+            text_to_check = self.model_tokenizer.decode(input_ids)
+        else:
+            text_to_check = self.model_tokenizer.decode(input_ids[-suffix_len:])
+        return self.think_end in text_to_check

	if self.reasoner.is_reasoning_end_on_decode_step(request.all_token_ids):
	if self.reasoner.is_reasoning_end(request.all_token_ids, setp="decode"):

Uh oh!

[Structured Output][Reasoning] Improves decoding throughput for models using single-token reasoning endings. #30056

Are you sure you want to change the base?

[Structured Output][Reasoning] Improves decoding throughput for models using single-token reasoning endings. #30056

Conversation

hdlj-h commented Dec 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Dec 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chaunceyjiang commented Dec 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hdlj-h Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hdlj-h Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

hdlj-h commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

hdlj-h commented Dec 4, 2025 •

edited by github-actions bot

Loading

chaunceyjiang Dec 4, 2025 •

edited

Loading

hdlj-h Dec 4, 2025 •

edited

Loading

hdlj-h Dec 4, 2025 •

edited

Loading

hdlj-h commented Dec 5, 2025 •

edited

Loading