add spec infer related into prometheus metrics. #4582

leiwen83 · 2024-05-03T15:00:11Z

And add a new boost_ratio metric used to directly show how much spec infer would help in saving decoding steps.

leiwen83 · 2024-05-03T15:02:53Z

cadedaniel · 2024-05-03T18:51:08Z

will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)

robertgshaw2-neuralmagic

Please update the Grafana Dashboard

robertgshaw2-neuralmagic · 2024-05-03T19:28:13Z

vllm/engine/metrics.py

@@ -59,7 +59,19 @@ def __init__(self, labelnames: List[str], max_model_len: int):
            name="vllm:cpu_cache_usage_perc",
            documentation="CPU KV-cache usage. 1 means 100 percent usage.",
            labelnames=labelnames)
-
+        #   Speculative infer Status in %


Please add better descriptions.

Please name these:

vllm:spec_decode_system_efficiency

vllm:spec_decode_boost_ratio

vllm:spec_decode_draft_acceptance_rate

What is the difference between these?

We prefer to use Counters >> Gauges

Is there a way these metrics could be expressed as Counters with the rate function used in PromQL to compute the rates?

maybe we could directly express the total emitted token, along with steps number? so that user could do the cal they want with those counters?

robertgshaw2-neuralmagic · 2024-05-03T19:30:29Z

will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)

+1

robertgshaw2-neuralmagic · 2024-05-03T19:30:48Z

Thanks for the contribution! It would be great to have these metrics flowing through prometheus!

leiwen83 · 2024-05-04T02:06:56Z

will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)

the new boost_ratio would express more accurate expression at how much system is benefit from spec info, as there is case that spec info give no proposal, like no matching in ngram or seqlen+spec exceed over model length.

Furthermore, with the new dynamic spec coming #4565, the k would not be constant one, so that we may need accumulate actual token emitted comparing with the steps.

leiwen83 · 2024-05-08T14:33:26Z

@cadedaniel @robertgshaw2-neuralmagic
Any comment for the latest PR change? :)

cadedaniel · 2024-05-09T21:30:32Z

asking @LiuXiaoxuanPKU if she has bandwidth to review the PR. the approach looks good to me, concerns are 1) we should make sure the top-level metrics make sense to users (not just to us as developers), 2) the naming of the metrics collection seems weird

robertgshaw2-neuralmagic · 2024-05-09T21:35:28Z

reviewed

cade + i discussing a path fwd

leiwen83 · 2024-05-17T10:31:01Z

Hi @robertgshaw2-neuralmagic @cadedaniel ,

How is going with the spec related metric, have we got the conclusion for how to make it happen? ;)
The metric is critical to us as a direct feedback reflecting how well current spec sys is doing.

cadedaniel · 2024-05-23T17:43:10Z

thanks & sorry this slipped. I might have time tomorrow to finish review. cc @LiuXiaoxuanPKU and @comaniac who might have bandwidth.

cadedaniel

I have bandwidth to review this now. The Prometheus stuff looks good to me, have concerns on the metric definition and how we collect them.

cadedaniel · 2024-06-05T00:32:13Z

vllm/model_executor/layers/rejection_sampler.py

+        # batch_size here maybe 0, as there is case
+        # where no proposal is generated
+        self.num_specs += output_with_bonus_tokens.size()[0]


Is this right? even if there are no specs this can be nonzero

For the ngram case, if we don't match anything, then there would be no proposal.

OK. These metrics should work for draft model as well.

cadedaniel · 2024-06-05T00:38:34Z

vllm/model_executor/layers/rejection_sampler.py

+        output_with_bonus_tokens = torch.cat(
+            [output_with_bonus_tokens, non_spec_token_ids])


num_emitted_tokens will include non-spec tokens, making the metric useless in capturing spec system efficiency (100% efficiency is defined every token accepted plus the bonus token)

cadedaniel · 2024-06-05T00:39:24Z

vllm/spec_decode/spec_decode_worker.py

-        accepted_token_ids = torch.cat(
-            [accepted_token_ids, non_spec_token_ids])


btw this cat should stay here; the format required for the cat is determined here original_indices = spec_indices + non_spec_indices

cadedaniel · 2024-06-05T00:43:44Z

vllm/spec_decode/metrics.py

@@ -162,6 +168,7 @@ def _collect_rejsample_metrics(

        return SpecDecodeWorkerMetrics(
            num_spec_tokens=k,
+            num_specs=self._aggregate_num_specs,


we can calculate this with draft_tokens // k, don't need to record

see get_max_num_emitted_tokens

And add a new boost_ratio metric used to directly show how much spec infer would help in saving decoding steps. Signed-off-by: Lei Wen <wenlei03@qiyi.com>

leiwen83 · 2024-06-07T02:42:46Z

@cadedaniel
I submit a rebased PR, which keep the concat logic as before. num_spec is made to aggregate "k" number.

github-actions · 2024-10-28T02:04:00Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2024-11-28T02:06:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @leiwen83.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robertgshaw2-neuralmagic requested changes May 3, 2024

View reviewed changes

leiwen83 force-pushed the spec_infer_prometheus_metric branch from bf96a47 to a6bf575 Compare May 4, 2024 11:25

robertgshaw2-neuralmagic mentioned this pull request May 19, 2024

v0.4.3 Release Tracker #4895

Closed

6 tasks

cadedaniel reviewed Jun 5, 2024

View reviewed changes

add spec infer related into prometheus metrics.

5306db8

And add a new boost_ratio metric used to directly show how much spec infer would help in saving decoding steps. Signed-off-by: Lei Wen <wenlei03@qiyi.com>

leiwen83 force-pushed the spec_infer_prometheus_metric branch from a6bf575 to 5306db8 Compare June 7, 2024 02:39

update

396e29a

github-actions bot added the stale label Oct 28, 2024

github-actions bot added unstale and removed stale labels Nov 28, 2024

mergify bot added the needs-rebase label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add spec infer related into prometheus metrics. #4582

add spec infer related into prometheus metrics. #4582

leiwen83 commented May 3, 2024

leiwen83 commented May 3, 2024

cadedaniel commented May 3, 2024

robertgshaw2-neuralmagic left a comment

robertgshaw2-neuralmagic May 3, 2024

robertgshaw2-neuralmagic May 3, 2024

leiwen83 May 4, 2024

robertgshaw2-neuralmagic commented May 3, 2024

robertgshaw2-neuralmagic commented May 3, 2024

leiwen83 commented May 4, 2024

leiwen83 commented May 8, 2024

cadedaniel commented May 9, 2024

robertgshaw2-neuralmagic commented May 9, 2024 •

edited

Loading

leiwen83 commented May 17, 2024

cadedaniel commented May 23, 2024

cadedaniel left a comment

cadedaniel Jun 5, 2024

leiwen83 Jun 5, 2024

cadedaniel Jun 6, 2024

cadedaniel Jun 5, 2024

cadedaniel Jun 5, 2024

cadedaniel Jun 5, 2024

cadedaniel Jun 5, 2024

leiwen83 commented Jun 7, 2024 •

edited

Loading

github-actions bot commented Oct 28, 2024

mergify bot commented Nov 28, 2024

		output_with_bonus_tokens = torch.cat(
		[output_with_bonus_tokens, non_spec_token_ids])

		accepted_token_ids = torch.cat(
		[accepted_token_ids, non_spec_token_ids])

add spec infer related into prometheus metrics. #4582

Are you sure you want to change the base?

add spec infer related into prometheus metrics. #4582

Conversation

leiwen83 commented May 3, 2024

leiwen83 commented May 3, 2024

cadedaniel commented May 3, 2024

robertgshaw2-neuralmagic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented May 3, 2024

robertgshaw2-neuralmagic commented May 3, 2024

leiwen83 commented May 4, 2024

leiwen83 commented May 8, 2024

cadedaniel commented May 9, 2024

robertgshaw2-neuralmagic commented May 9, 2024 • edited Loading

leiwen83 commented May 17, 2024

cadedaniel commented May 23, 2024

cadedaniel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leiwen83 commented Jun 7, 2024 • edited Loading

github-actions bot commented Oct 28, 2024

mergify bot commented Nov 28, 2024

robertgshaw2-neuralmagic commented May 9, 2024 •

edited

Loading

leiwen83 commented Jun 7, 2024 •

edited

Loading