You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TLDR: We would like an option we can enable to continuously stream the UsageInfo when using the streaming completions API. This solves a number of "accounting" problems encountered while trying to do accurate performance evaluation.
Motivation:
We are working on performance evaluation for vLLM's more advanced features (chunked prefill, speculative decoding) and have run into a few problems that we feel would be solved by adding a simple new feature. Our benchmarking framework fmperf computes ITL by measuring the latency between consequence streaming responses, in addition it computes throughput by inspecting each streaming response and counting the number of tokens in it. Currently vLLM provides no information about how many tokens are contained within each response. In most situation, it is just a single token. However, there are few scenarios where this is not the case:
When chunked prefill is enabled and a prompt gets chunked across multiple iterations, the first few responses may contain zero tokens. There is no special indication that this has happened beyond just an empty string "" being returned. We have found scenarios when "" is actually valid token, so just choosing to ignore this specific response may lead to us discarding responses that are actually valid. It would be nice to have some explicit indication that this response is truly empty.
When speculative decoding is enabled, each streaming response may contain more than one token. Right now we just get the text, rather than the actual tokens so we can't actually tell exactly how many tokens were generated without either (a) running it through a tokenizer again or (b) enabling logprobs or similar which may have performance implications for speculative decoding.
We have considered an architectural change to fmperf where instead of counting the tokens for each streaming response, we simply wait until the final response has been received. vLLM provides the OpenAI-compliant include_usage option that will give us all the stats we need at the very end. This is helpful, but when we are benchmarking we often want to run an experiment for a specific duration and requests that run over the duration will get cancelled on the client-side. Currently, we would have no way to account for the tokens that got generated in such a partially-executed request. Similarly, if a request fails for whatever reason partway through its execution, we don't have any way to get the stats out. There were actually a bunch of comments from people with similar concerns (e.g. here and here) when OpenAI announced this new feature.
We would like to propose adding a simple new option to vLLM to enable continuous streaming of the usage stats with every single streaming response. It is not a big change, but enables more accurate accounting for the number of tokens generated. We can still have the default behaviour remain as-is (usage stats at the end of the request).
Alternatives
see above
Additional context
We are preparing a PR with this feature which we will post shortly.
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
TLDR: We would like an option we can enable to continuously stream the
UsageInfo
when using the streaming completions API. This solves a number of "accounting" problems encountered while trying to do accurate performance evaluation.Motivation:
We are working on performance evaluation for vLLM's more advanced features (chunked prefill, speculative decoding) and have run into a few problems that we feel would be solved by adding a simple new feature. Our benchmarking framework fmperf computes ITL by measuring the latency between consequence streaming responses, in addition it computes throughput by inspecting each streaming response and counting the number of tokens in it. Currently vLLM provides no information about how many tokens are contained within each response. In most situation, it is just a single token. However, there are few scenarios where this is not the case:
""
being returned. We have found scenarios when""
is actually valid token, so just choosing to ignore this specific response may lead to us discarding responses that are actually valid. It would be nice to have some explicit indication that this response is truly empty.We have considered an architectural change to fmperf where instead of counting the tokens for each streaming response, we simply wait until the final response has been received. vLLM provides the OpenAI-compliant
include_usage
option that will give us all the stats we need at the very end. This is helpful, but when we are benchmarking we often want to run an experiment for a specific duration and requests that run over the duration will get cancelled on the client-side. Currently, we would have no way to account for the tokens that got generated in such a partially-executed request. Similarly, if a request fails for whatever reason partway through its execution, we don't have any way to get the stats out. There were actually a bunch of comments from people with similar concerns (e.g. here and here) when OpenAI announced this new feature.We would like to propose adding a simple new option to vLLM to enable continuous streaming of the usage stats with every single streaming response. It is not a big change, but enables more accurate accounting for the number of tokens generated. We can still have the default behaviour remain as-is (usage stats at the end of the request).
Alternatives
see above
Additional context
We are preparing a PR with this feature which we will post shortly.
The text was updated successfully, but these errors were encountered: