Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add spec infer related into prometheus metrics. #4582

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

leiwen83
Copy link
Contributor

@leiwen83 leiwen83 commented May 3, 2024

And add a new boost_ratio metric used to directly show how much spec infer would help in saving decoding steps.

@leiwen83
Copy link
Contributor Author

leiwen83 commented May 3, 2024

cc @cadedaniel

@cadedaniel
Copy link
Collaborator

will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)

Copy link
Collaborator

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the Grafana Dashboard

@@ -59,7 +59,19 @@ def __init__(self, labelnames: List[str], max_model_len: int):
name="vllm:cpu_cache_usage_perc",
documentation="CPU KV-cache usage. 1 means 100 percent usage.",
labelnames=labelnames)

# Speculative infer Status in %
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add better descriptions.

Please name these:

  • vllm:spec_decode_system_efficiency
  • vllm:spec_decode_boost_ratio
  • vllm:spec_decode_draft_acceptance_rate

What is the difference between these?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to use Counters >> Gauges

Is there a way these metrics could be expressed as Counters with the rate function used in PromQL to compute the rates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could directly express the total emitted token, along with steps number? so that user could do the cal they want with those counters?

@robertgshaw2-neuralmagic
Copy link
Collaborator

will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)

+1

@robertgshaw2-neuralmagic
Copy link
Collaborator

Thanks for the contribution! It would be great to have these metrics flowing through prometheus!

@leiwen83
Copy link
Contributor Author

leiwen83 commented May 4, 2024

will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)

the new boost_ratio would express more accurate expression at how much system is benefit from spec info, as there is case that spec info give no proposal, like no matching in ngram or seqlen+spec exceed over model length.

Furthermore, with the new dynamic spec coming #4565, the k would not be constant one, so that we may need accumulate actual token emitted comparing with the steps.

@leiwen83 leiwen83 force-pushed the spec_infer_prometheus_metric branch from bf96a47 to a6bf575 Compare May 4, 2024 11:25
@leiwen83
Copy link
Contributor Author

leiwen83 commented May 8, 2024

@cadedaniel @robertgshaw2-neuralmagic
Any comment for the latest PR change? :)

@cadedaniel
Copy link
Collaborator

asking @LiuXiaoxuanPKU if she has bandwidth to review the PR. the approach looks good to me, concerns are 1) we should make sure the top-level metrics make sense to users (not just to us as developers), 2) the naming of the metrics collection seems weird

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented May 9, 2024

reviewed

cade + i discussing a path fwd

@leiwen83
Copy link
Contributor Author

Hi @robertgshaw2-neuralmagic @cadedaniel ,

How is going with the spec related metric, have we got the conclusion for how to make it happen? ;)
The metric is critical to us as a direct feedback reflecting how well current spec sys is doing.

@cadedaniel
Copy link
Collaborator

thanks & sorry this slipped. I might have time tomorrow to finish review. cc @LiuXiaoxuanPKU and @comaniac who might have bandwidth.

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have bandwidth to review this now. The Prometheus stuff looks good to me, have concerns on the metric definition and how we collect them.

Comment on lines 339 to 341
# batch_size here maybe 0, as there is case
# where no proposal is generated
self.num_specs += output_with_bonus_tokens.size()[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? even if there are no specs this can be nonzero

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the ngram case, if we don't match anything, then there would be no proposal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. These metrics should work for draft model as well.

Comment on lines 332 to 333
output_with_bonus_tokens = torch.cat(
[output_with_bonus_tokens, non_spec_token_ids])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_emitted_tokens will include non-spec tokens, making the metric useless in capturing spec system efficiency (100% efficiency is defined every token accepted plus the bonus token)

Comment on lines 326 to 462
accepted_token_ids = torch.cat(
[accepted_token_ids, non_spec_token_ids])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw this cat should stay here; the format required for the cat is determined here original_indices = spec_indices + non_spec_indices

@@ -162,6 +168,7 @@ def _collect_rejsample_metrics(

return SpecDecodeWorkerMetrics(
num_spec_tokens=k,
num_specs=self._aggregate_num_specs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can calculate this with draft_tokens // k, don't need to record

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see get_max_num_emitted_tokens

And add a new boost_ratio metric used to directly show how much spec
infer would help in saving decoding steps.

Signed-off-by: Lei Wen <wenlei03@qiyi.com>
@leiwen83 leiwen83 force-pushed the spec_infer_prometheus_metric branch from a6bf575 to 5306db8 Compare June 7, 2024 02:39
@leiwen83
Copy link
Contributor Author

leiwen83 commented Jun 7, 2024

@cadedaniel
I submit a rebased PR, which keep the concat logic as before. num_spec is made to aggregate "k" number.

Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 28, 2024
@github-actions github-actions bot added unstale and removed stale labels Nov 28, 2024
Copy link

mergify bot commented Nov 28, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @leiwen83.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants