-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add spec infer related into prometheus metrics. #4582
base: main
Are you sure you want to change the base?
Conversation
cc @cadedaniel |
will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the Grafana Dashboard
@@ -59,7 +59,19 @@ def __init__(self, labelnames: List[str], max_model_len: int): | |||
name="vllm:cpu_cache_usage_perc", | |||
documentation="CPU KV-cache usage. 1 means 100 percent usage.", | |||
labelnames=labelnames) | |||
|
|||
# Speculative infer Status in % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add better descriptions.
Please name these:
vllm:spec_decode_system_efficiency
vllm:spec_decode_boost_ratio
vllm:spec_decode_draft_acceptance_rate
What is the difference between these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer to use Counters >> Gauges
Is there a way these metrics could be expressed as Counters with the rate
function used in PromQL to compute the rates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could directly express the total emitted token, along with steps number? so that user could do the cal they want with those counters?
+1 |
Thanks for the contribution! It would be great to have these metrics flowing through prometheus! |
the new boost_ratio would express more accurate expression at how much system is benefit from spec info, as there is case that spec info give no proposal, like no matching in ngram or seqlen+spec exceed over model length. Furthermore, with the new dynamic spec coming #4565, the k would not be constant one, so that we may need accumulate actual token emitted comparing with the steps. |
bf96a47
to
a6bf575
Compare
@cadedaniel @robertgshaw2-neuralmagic |
asking @LiuXiaoxuanPKU if she has bandwidth to review the PR. the approach looks good to me, concerns are 1) we should make sure the top-level metrics make sense to users (not just to us as developers), 2) the naming of the metrics collection seems weird |
reviewed cade + i discussing a path fwd |
Hi @robertgshaw2-neuralmagic @cadedaniel , How is going with the spec related metric, have we got the conclusion for how to make it happen? ;) |
thanks & sorry this slipped. I might have time tomorrow to finish review. cc @LiuXiaoxuanPKU and @comaniac who might have bandwidth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have bandwidth to review this now. The Prometheus stuff looks good to me, have concerns on the metric definition and how we collect them.
# batch_size here maybe 0, as there is case | ||
# where no proposal is generated | ||
self.num_specs += output_with_bonus_tokens.size()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? even if there are no specs this can be nonzero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the ngram case, if we don't match anything, then there would be no proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. These metrics should work for draft model as well.
output_with_bonus_tokens = torch.cat( | ||
[output_with_bonus_tokens, non_spec_token_ids]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_emitted_tokens
will include non-spec tokens, making the metric useless in capturing spec system efficiency (100% efficiency is defined every token accepted plus the bonus token)
accepted_token_ids = torch.cat( | ||
[accepted_token_ids, non_spec_token_ids]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw this cat should stay here; the format required for the cat is determined here original_indices = spec_indices + non_spec_indices
@@ -162,6 +168,7 @@ def _collect_rejsample_metrics( | |||
|
|||
return SpecDecodeWorkerMetrics( | |||
num_spec_tokens=k, | |||
num_specs=self._aggregate_num_specs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can calculate this with draft_tokens // k
, don't need to record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see get_max_num_emitted_tokens
And add a new boost_ratio metric used to directly show how much spec infer would help in saving decoding steps. Signed-off-by: Lei Wen <wenlei03@qiyi.com>
a6bf575
to
5306db8
Compare
@cadedaniel |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has merge conflicts that must be resolved before it can be |
And add a new boost_ratio metric used to directly show how much spec infer would help in saving decoding steps.