[Feature]: Continuous streaming of `UsageInfo`

### 🚀 The feature, motivation and pitch

**TLDR:** We would like an option we can enable to continuously stream the `UsageInfo` when using the streaming completions API. This solves a number of "accounting" problems encountered while trying to do accurate performance evaluation. 

**Motivation:**
We are working on performance evaluation for vLLM's more advanced features (chunked prefill, speculative decoding) and have run into a few problems that we feel would be solved by adding a simple new feature. Our benchmarking framework [fmperf](https://github.com/fmperf-project/fmperf) computes ITL by measuring the latency between consequence streaming responses, in addition it computes throughput by inspecting each streaming response and counting the number of tokens in it. Currently vLLM provides no information about how many tokens are contained within each response. In most situation, it is just a single token. However, there are few scenarios where this is not the case:
1. When chunked prefill is enabled and a prompt gets chunked across multiple iterations, the first few responses may contain zero tokens. There is no special indication that this has happened beyond just an empty string `""` being returned. We have found scenarios when `""` is actually valid token, so just choosing to ignore this specific response may lead to us discarding responses that are actually valid. It would be nice to have some explicit indication that this response is truly empty. 
2. When speculative decoding is enabled, each streaming response may contain more than one token. Right now we just get the text, rather than the actual tokens so we can't actually tell exactly how many tokens were generated without either (a) running it through a tokenizer again or (b) enabling logprobs or similar which may have performance implications for speculative decoding. 

We have considered an architectural change to fmperf where instead of counting the tokens for each streaming response, we simply wait until the final response has been received. vLLM provides the OpenAI-compliant `include_usage` option that will give us all the stats we need at the very end. This is helpful, but when we are benchmarking we often want to run an experiment for a specific duration and requests that run over the duration will get cancelled on the client-side. Currently, we would have no way to account for the tokens that got generated in such a partially-executed request. Similarly, if a request fails for whatever reason partway through its execution, we don't have any way to get the stats out. There were actually a bunch of comments from people with similar concerns (e.g. [here](https://community.openai.com/t/usage-stats-now-available-when-using-streaming-with-the-chat-completions-api-or-completions-api/738156/4) and [here](https://community.openai.com/t/usage-stats-now-available-when-using-streaming-with-the-chat-completions-api-or-completions-api/738156/12)) when OpenAI announced this new feature. 

We would like to propose adding a simple new option to vLLM to enable continuous streaming of the usage stats with every single streaming response. It is not a big change, but enables more accurate accounting for the number of tokens generated. We can still have the default behaviour remain as-is (usage stats at the end of the request). 

### Alternatives

see above

### Additional context

We are preparing a PR with this feature which we will post shortly. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Continuous streaming of `UsageInfo` #5708

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Continuous streaming of UsageInfo #5708

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature]: Continuous streaming of `UsageInfo` #5708