Skip to content

[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled #18879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 31, 2025

Conversation

chaunceyjiang
Copy link
Contributor

@chaunceyjiang chaunceyjiang commented May 29, 2025

FIX #18821 (comment)

Introduced by PR #16577.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@chaunceyjiang chaunceyjiang force-pushed the think_struct_output branch from f4267d7 to 4f20677 Compare May 29, 2025 10:22
…Thinking is disabled

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang
Copy link
Contributor Author

Qwen3 and DeepSeek_R1

# vllm serve /home/jovyan/public-models/Deepseek-R1-Distill-Qwen-14B  --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser deepseek_r1


# vllm serve /home/jovyan/qwen3-32b-awq  --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3

python test.py

from openai import OpenAI
from pydantic import BaseModel


client = OpenAI(api_key="xxx", base_url="http://127.0.0.1:8000/v1")


class OutputModel(BaseModel):
    result: int


prompt = """\
123+456等于多少?
结果以JSON格式给出:
{{
    "result": "结果"
}}
"""


rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": OutputModel.model_json_schema(),},
    temperature=0,
)
print('---')
print(rsp.choices[0].message.content)
print('---')

rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}, "guided_json": OutputModel.model_json_schema()},
    temperature=0.7,
)
print('---')
print(rsp.choices[0].message.content)
print('---')

# or
rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": OutputModel.model_json_schema()},
    temperature=0,
)
print('---')
print(rsp.choices[0].message.content)
print('---')
# service will be blocked
rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": OutputModel.model_json_schema()},
    temperature=0.7,
)
print('---')
print(rsp.choices[0].message.content)
print('---')
class Step(BaseModel):
    ground_truth_key_ideas: str 
    system_response_key_ideas: str
    discussion: str
    recall: float
    precision: float



# client.chat.completions.create
json_schema = Step.model_json_schema()

chat_response = client.beta.chat.completions.parse(
    model="",
    messages=[
        {'role': 'system',
        'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n  "system_response_key_ideas": "{system_response_key_ideas}",\n  "discussion": "{discussion}",\n  "recall": "{recall}        # note: the value you produce must be a single float value",\n  "precision": "{precision}        # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n        Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
        {'role': 'user',
        'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
    ],
    temperature=0.0,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": json_schema},
)
print("-----")
print(chat_response.choices[0].message.content)
print("-----")

chat_response = client.beta.chat.completions.parse(
    model="",
    messages=[
        {'role': 'system',
        'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n  "system_response_key_ideas": "{system_response_key_ideas}",\n  "discussion": "{discussion}",\n  "recall": "{recall}        # note: the value you produce must be a single float value",\n  "precision": "{precision}        # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n        Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
        {'role': 'user',
        'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
    ],
    temperature=0.0,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}, "guided_json": json_schema},
)
print("-----")
print(chat_response.choices[0].message.content)
print("-----")


---
{  
    "result": 579
}
---
---
{ "result": 579 }
---
---
{  
    "result": 579
}
---
---
{  
    "result": 579  
}
---
-----
{
  "ground_truth_key_ideas": "1. The action space for language modeling is equal to the vocabulary set of language models. 2. The vocabulary size is very large (tens of thousands of possible tokens). 3. Real-world locomotion can be condensed to three quantities (X, Y, or Z axes).",
  "system_response_key_ideas": "1. Action space in language modeling is large due to vocabulary size (e.g., 50k+ tokens). 2. Combinatorial explosion in sequence generation. 3. Discrete, high-cardinality decisions vs. real-world continuous control. 4. Techniques like actor-critic methods and action space reduction address challenges.",
  "discussion": "The system response accurately captures the key ideas from the ground truth, including the relationship between vocabulary size and action space, and the comparison to real-world action spaces. Additionally, the system response provides more detailed explanations, such as the combinatorial explosion in sequence generation and the distinction between discrete and continuous action spaces. It also mentions specific techniques for addressing these challenges, which are not covered in the ground truth.",
  "recall": 1.0,
  "precision": 0.6666666666666666
}
-----
-----
{
  "ground_truth_key_ideas": "1. The action space for language modeling is equal to the vocabulary set of language models. 2. The vocabulary size is very large (tens of thousands of possible tokens). 3. Real-world locomotion can be condensed to three axes (X, Y, Z) or linear combinations thereof.",
  "system_response_key_ideas": "1. Action space in language modeling is large due to vocabulary size (e.g., 50k+ tokens). 2. The action space involves discrete, high-cardinality decisions with combinatorial complexity. 3. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 4. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) manage variance and exploration in discrete, large vocabularies. 5. Action space reduction techniques (e.g., GALAD) are used for handling large vocabularies.",
  "discussion": "The system response fully covers all key ideas from the ground truth and adds additional details. It expands on the challenges of discrete, high-cardinality decisions in language modeling and contrasts them with continuous control in real-world actions. The system also mentions specific techniques to address these challenges, which were not discussed in the ground truth.",
  "recall": 1.0,
  "precision": 1.0
}
-----

@chaunceyjiang
Copy link
Contributor Author

/cc @aarnphm PTAL.

Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@chaunceyjiang
Copy link
Contributor Author

@DarkLight1337 PTAL。

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) May 31, 2025 06:51
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 31, 2025
@DarkLight1337 DarkLight1337 merged commit ba5111f into vllm-project:main May 31, 2025
65 of 67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed structured-output v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Bug]: In Version V0.9.0, Qwen3-32B-AWQ Error when turn off thinking and use guided_json simultaneously.
3 participants