Description
Your current environment
Hardware / python dependencies are not relevant for this issue.
🐛 Describe the bug
Streaming chat completion outputs in openai-compatible serving since vllm==0.9.0 does not include the "finish_reason": null
field in its HTTP response stream like the following:
data: {"id":"chatcmpl-c4eefdc442064e16be757120a2ec7703","object":"chat.completion.chunk","created":1749904375,"model":"meta-llama/llama-4-scout-17b-16e-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-c4eefdc442064e16be757120a2ec7703","object":"chat.completion.chunk","created":1749904375,"model":"meta-llama/llama-4-scout-17b-16e-instruct","choices":[{"index":0,"delta":{"content":"안","tool_calls":[]}}]}
data: {"id":"chatcmpl-c4eefdc442064e16be757120a2ec7703","object":"chat.completion.chunk","created":1749904375,"model":"meta-llama/llama-4-scout-17b-16e-instruct","choices":[{"index":0,"delta":{"content":"녕","tool_calls":[]}}]}
...
Expected chat completion output (i.e. output from vllm==0.8.5.post1) for comparison:
data: {"id":"chatcmpl-0c0dd0fa4fae4544998222ce6edde854","object":"chat.completion.chunk","created":1749904312,"model":"meta-llama/llama-4-scout-17b-16e-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-0c0dd0fa4fae4544998222ce6edde854","object":"chat.completion.chunk","created":1749904312,"model":"meta-llama/llama-4-scout-17b-16e-instruct","choices":[{"index":0,"delta":{"content":"안"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-0c0dd0fa4fae4544998222ce6edde854","object":"chat.completion.chunk","created":1749904312,"model":"meta-llama/llama-4-scout-17b-16e-instruct","choices":[{"index":0,"delta":{"content":"녕"},"logprobs":null,"finish_reason":null}]}
...
According to the official OpenAI API reference, finish_reason
field should be given for each chat completion chunks; see the streaming response example there. It is also incompatible with the current vLLM's text completion chunks:
data: {"id":"cmpl-fcf5cc0826e8476d962e9c5f4130b368","object":"text_completion","created":1749906268,"model":"meta-llama/llama-4-scout-17b-16e-instruct","choices":[{"index":0,"text":" 난","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
data: {"id":"cmpl-fcf5cc0826e8476d962e9c5f4130b368","object":"text_completion","created":1749906268,"model":"meta-llama/llama-4-scout-17b-16e-instruct","choices":[{"index":0,"text":" 누","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
data: {"id":"cmpl-fcf5cc0826e8476d962e9c5f4130b368","object":"text_completion","created":1749906268,"model":"meta-llama/llama-4-scout-17b-16e-instruct","choices":[{"index":0,"text":"군","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
Text completion chunks preserves "finish_reason":null
field as shown above.
This behavioral change was introduced in #17340, where the options for json-dumping is changed from ChatCompletionStreamResponse.model_dump_json(exclude_unset=True)
to ChatCompletionStreamingResponse.model_dump_json(exclude_none=True)
. It seems we need a better way to deal with the DeltaToolCall handling than simple exclude_none
.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.