[V1][Metrics] Initial speculative decoding metrics #15151

markmc · 2025-03-19T18:44:44Z

github-actions · 2025-03-19T18:44:54Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/v1/sample/rejection_sampler.py

vllm/v1/spec_decode/metrics.py

markmc · 2025-03-20T18:12:51Z

vllm/v1/core/scheduler.py

@@ -557,6 +559,7 @@ def update_from_output(
                # Otherwise, we ignore the sampler output for the request.
                request.num_computed_tokens += num_tokens_scheduled
                assert request.num_computed_tokens <= request.num_tokens
+                spec_decoding_stats.num_emitted_tokens += num_tokens_scheduled


I'm a bit uncertain whether to include this in the count ... it matches what I was getting at the rejection sampler level, but ... I guess it depends what we want the system efficiency metric to mean.

Will dig into it, but thoughts welcome!

See commit 586f123

[V1][Speculative Decoding] Properly account for ngram drafter When the ngram drafter proposes zero tokens, we should treat this as a proposal of num_spec_tokens which were all rejected. This will allow us to compare fairly across ngram vs draft models. Encode a zero-token proposal in ModelRunnerOutput as an empty list, whereas None means no speculating was done. This also requires Request.spec_token_ids empty list to be interpretted this way by the scheduler.

mergify · 2025-03-21T20:25:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-03-27T05:53:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc · 2025-03-28T18:34:32Z

I've rebased and pulled out a bunch of stuff from this PR to try and simplify the discussion:

I've removed the system efficiency calculation, along with the num_proposals tracking
I think we should remove num_emitted_tokens also since it's mainly there for the system efficiency calculation

On the topic of properly tracking num_accepted_tokens, using Lily's example:

When the draft model proposes tokens, it also assigns a probability to each token. For example, suppose the draft model proposes three tokens [t1, t2, t3] with associated probabilities [0.6, 0.7, 0.6]. The target model then evaluates these tokens and provides its own probabilities, say [0.7, 0.2, 0.8]. To give some intuition, the draft model is quite confident about t2 (0.7), but the target model assigns a much lower probability (0.2), suggesting it likely considers t2 incorrect.
After rejection sampling, V0 produces an accepted mask—for this example, [1, 0, 1]. This mask is computed purely based on the raw probabilities, without accounting for whether previous tokens were accepted. From a decoding perspective, only t1 should be considered accepted—we stop at the first rejected token (t2), and don’t proceed to t3. So the actual number of accepted tokens is 1.

This PR will calculate num_accepted_tokens = 1 correctly, and it's pretty clear why - because we're doing this in the scheduler, we actually have no insight into the probability assigned to t3 by the target model

markmc · 2025-03-31T13:38:02Z

I've removed num_emitted_tokens too now, and added a scheduler test case

markmc · 2025-03-31T13:44:00Z

Note also, at @WoosukKwon's request we retain the most simplistic behavior for ngram prompt lookup - for num_draft_tokens, we only count tokens that the drafter proposed instead of, for example, counting num_speculative_tokens for each step. This means an acceptance rate of 0.5 would be common with ngram, but this acceptance rate probably gives nowhere near the speedup achieved by a draft model with acceptance rate of 0.5

markmc · 2025-03-31T16:03:14Z

For reference, here is a mini design doc we've been using to discuss the above

Fixes vllm-project#13990, part of vllm-project#10582 Omitting system efficiency for now. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Now just num_accepted_tokens, num_draft_tokens, and acceptance rate. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2025-03-31T16:24:41Z

Rebased to pull in CI fix from #15757

WoosukKwon · 2025-04-01T06:17:19Z

Looks generally good to me. However, as we discussed offline, I think the draft & accepted number of tokens per position (0, 1, ..., num_speculative_tokens-1) would be very much useful to understand how well the draft model is doing. WDYT?

markmc · 2025-04-01T06:46:24Z

Looks generally good to me. However, as we discussed offline, I think the draft & accepted number of tokens per position (0, 1, ..., num_speculative_tokens-1) would be very much useful to understand how well the draft model is doing. WDYT?

Absolutely. I think there's a lot more ideas to explore. I'd like to take this incrementally though and start by adding back the acceptance rate metric that we had in V0. This lays the groundwork for adding future metrics also.

markmc · 2025-04-01T06:50:30Z

Looks generally good to me. However, as we discussed offline, I think the draft & accepted number of tokens per position (0, 1, ..., num_speculative_tokens-1) would be very much useful to understand how well the draft model is doing. WDYT?

Absolutely. I think there's a lot more ideas to explore. I'd like to take this incrementally though and start by adding back the acceptance rate metric that we had in V0. This lays the groundwork for adding future metrics also.

Unless we think the acceptance rate metric (i.e. num_draft and num_accepted) itself is useless and we want to deprecate those too! I don't believe so, though. I think we will continue to want a single metric to compare performance over time and between models.

WoosukKwon

@markmc Got it. Looks good to me as the first step.

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

mergify bot added the v1 label Mar 19, 2025

WoosukKwon reviewed Mar 19, 2025

View reviewed changes

vllm/v1/sample/rejection_sampler.py Outdated Show resolved Hide resolved

markmc commented Mar 20, 2025

View reviewed changes

vllm/v1/spec_decode/metrics.py Outdated Show resolved Hide resolved

markmc force-pushed the metrics-v1-spec-decoding branch from cd3ecad to 4cac140 Compare March 20, 2025 18:10

markmc commented Mar 20, 2025

View reviewed changes

markmc force-pushed the metrics-v1-spec-decoding branch 2 times, most recently from 7f4ecd7 to 4cac140 Compare March 21, 2025 20:25

mergify bot added the needs-rebase label Mar 21, 2025

markmc force-pushed the metrics-v1-spec-decoding branch from 4cac140 to 5bb7953 Compare March 24, 2025 18:32

mergify bot added tpu Related to Google TPUs and removed needs-rebase labels Mar 24, 2025

mergify bot added needs-rebase and removed tpu Related to Google TPUs labels Mar 27, 2025

markmc force-pushed the metrics-v1-spec-decoding branch from 5bb7953 to 478238d Compare March 28, 2025 18:12

mergify bot removed the needs-rebase label Mar 28, 2025

markmc changed the title ~~[WIP][V1][Metrics] Speculative decoding metrics~~ [V1][Metrics] Initial speculative decoding metrics Mar 31, 2025

markmc marked this pull request as ready for review March 31, 2025 13:37

markmc requested review from robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners March 31, 2025 13:37

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 31, 2025

markmc added 3 commits March 31, 2025 12:23

[WIP][V1][Metrics] Speculative decoding metrics

1647956

Fixes vllm-project#13990, part of vllm-project#10582 Omitting system efficiency for now. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[V1][Spec Decoding] Strip metrics back to acceptance rate

85ce056

Now just num_accepted_tokens, num_draft_tokens, and acceptance rate. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[V1][Spec Decoding] Add scheduler test cases

840f4ce

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc force-pushed the metrics-v1-spec-decoding branch from 19b10b8 to 840f4ce Compare March 31, 2025 16:23

WoosukKwon approved these changes Apr 1, 2025

View reviewed changes

WoosukKwon merged commit a79cc68 into vllm-project:main Apr 1, 2025
32 checks passed

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Apr 2, 2025

[V1][Metrics] Initial speculative decoding metrics (vllm-project#15151)

dbca650

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025

[V1][Metrics] Initial speculative decoding metrics (vllm-project#15151)

e0d9aa6

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[V1][Metrics] Initial speculative decoding metrics (vllm-project#15151)

bb13560

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ekagra-ranjan mentioned this pull request Apr 8, 2025

[V1][Spec Decode] Eagle Model loading #16035

Merged

nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025

[V1][Metrics] Initial speculative decoding metrics (vllm-project#15151)

da15b57

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

luyuzhe111 mentioned this pull request Apr 10, 2025

[V1] Add request-level, per-step acceptance counts tracking for spec dec. #16367

Open

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[V1][Metrics] Initial speculative decoding metrics (vllm-project#15151)

493066b

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[V1][Metrics] Initial speculative decoding metrics (vllm-project#15151)

d3fcbbe

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[V1][Metrics] Initial speculative decoding metrics (vllm-project#15151)

8746221

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1][Metrics] Initial speculative decoding metrics #15151

[V1][Metrics] Initial speculative decoding metrics #15151

markmc commented Mar 19, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 19, 2025

markmc Mar 20, 2025

markmc Mar 24, 2025

mergify bot commented Mar 21, 2025

mergify bot commented Mar 27, 2025

markmc commented Mar 28, 2025

markmc commented Mar 31, 2025

markmc commented Mar 31, 2025

markmc commented Mar 31, 2025

markmc commented Mar 31, 2025

WoosukKwon commented Apr 1, 2025

markmc commented Apr 1, 2025

markmc commented Apr 1, 2025

WoosukKwon left a comment

[V1][Metrics] Initial speculative decoding metrics #15151

[V1][Metrics] Initial speculative decoding metrics #15151

Conversation

markmc commented Mar 19, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 19, 2025

markmc Mar 20, 2025

Choose a reason for hiding this comment

markmc Mar 24, 2025

Choose a reason for hiding this comment

mergify bot commented Mar 21, 2025

mergify bot commented Mar 27, 2025

markmc commented Mar 28, 2025

markmc commented Mar 31, 2025

markmc commented Mar 31, 2025

markmc commented Mar 31, 2025

markmc commented Mar 31, 2025

WoosukKwon commented Apr 1, 2025

markmc commented Apr 1, 2025

markmc commented Apr 1, 2025

WoosukKwon left a comment

Choose a reason for hiding this comment

markmc commented Mar 19, 2025 •

edited by github-actions bot

Loading