Skip to content

[V1][Metrics] Initial speculative decoding metrics #15151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 1, 2025

Conversation

markmc
Copy link
Member

@markmc markmc commented Mar 19, 2025

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Mar 19, 2025
@markmc markmc force-pushed the metrics-v1-spec-decoding branch from cd3ecad to 4cac140 Compare March 20, 2025 18:10
@@ -557,6 +559,7 @@ def update_from_output(
# Otherwise, we ignore the sampler output for the request.
request.num_computed_tokens += num_tokens_scheduled
assert request.num_computed_tokens <= request.num_tokens
spec_decoding_stats.num_emitted_tokens += num_tokens_scheduled
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit uncertain whether to include this in the count ... it matches what I was getting at the rejection sampler level, but ... I guess it depends what we want the system efficiency metric to mean.

Will dig into it, but thoughts welcome!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commit 586f123

    [V1][Speculative Decoding] Properly account for ngram drafter
    
    When the ngram drafter proposes zero tokens, we should treat
    this as a proposal of num_spec_tokens which were all rejected.
    This will allow us to compare fairly across ngram vs draft
    models.
    
    Encode a zero-token proposal in ModelRunnerOutput as an
    empty list, whereas None means no speculating was done. This
    also requires Request.spec_token_ids empty list to be
    interpretted this way by the scheduler.

@markmc markmc force-pushed the metrics-v1-spec-decoding branch 2 times, most recently from 7f4ecd7 to 4cac140 Compare March 21, 2025 20:25
Copy link

mergify bot commented Mar 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 21, 2025
@markmc markmc force-pushed the metrics-v1-spec-decoding branch from 4cac140 to 5bb7953 Compare March 24, 2025 18:32
@mergify mergify bot added tpu Related to Google TPUs and removed needs-rebase labels Mar 24, 2025
Copy link

mergify bot commented Mar 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added needs-rebase and removed tpu Related to Google TPUs labels Mar 27, 2025
@markmc markmc force-pushed the metrics-v1-spec-decoding branch from 5bb7953 to 478238d Compare March 28, 2025 18:12
@mergify mergify bot removed the needs-rebase label Mar 28, 2025
@markmc
Copy link
Member Author

markmc commented Mar 28, 2025

I've rebased and pulled out a bunch of stuff from this PR to try and simplify the discussion:

  • I've removed the system efficiency calculation, along with the num_proposals tracking
  • I think we should remove num_emitted_tokens also since it's mainly there for the system efficiency calculation

On the topic of properly tracking num_accepted_tokens, using Lily's example:

When the draft model proposes tokens, it also assigns a probability to each token. For example, suppose the draft model proposes three tokens [t1, t2, t3] with associated probabilities [0.6, 0.7, 0.6]. The target model then evaluates these tokens and provides its own probabilities, say [0.7, 0.2, 0.8]. To give some intuition, the draft model is quite confident about t2 (0.7), but the target model assigns a much lower probability (0.2), suggesting it likely considers t2 incorrect.
After rejection sampling, V0 produces an accepted mask—for this example, [1, 0, 1]. This mask is computed purely based on the raw probabilities, without accounting for whether previous tokens were accepted. From a decoding perspective, only t1 should be considered accepted—we stop at the first rejected token (t2), and don’t proceed to t3. So the actual number of accepted tokens is 1.

This PR will calculate num_accepted_tokens = 1 correctly, and it's pretty clear why - because we're doing this in the scheduler, we actually have no insight into the probability assigned to t3 by the target model

@markmc markmc changed the title [WIP][V1][Metrics] Speculative decoding metrics [V1][Metrics] Initial speculative decoding metrics Mar 31, 2025
@markmc markmc marked this pull request as ready for review March 31, 2025 13:37
@markmc
Copy link
Member Author

markmc commented Mar 31, 2025

I've removed num_emitted_tokens too now, and added a scheduler test case

@markmc
Copy link
Member Author

markmc commented Mar 31, 2025

Note also, at @WoosukKwon's request we retain the most simplistic behavior for ngram prompt lookup - for num_draft_tokens, we only count tokens that the drafter proposed instead of, for example, counting num_speculative_tokens for each step. This means an acceptance rate of 0.5 would be common with ngram, but this acceptance rate probably gives nowhere near the speedup achieved by a draft model with acceptance rate of 0.5

@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 31, 2025
@markmc
Copy link
Member Author

markmc commented Mar 31, 2025

For reference, here is a mini design doc we've been using to discuss the above

markmc added 3 commits March 31, 2025 12:23
Fixes vllm-project#13990, part of vllm-project#10582

Omitting system efficiency for now.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Now just num_accepted_tokens, num_draft_tokens, and acceptance rate.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the metrics-v1-spec-decoding branch from 19b10b8 to 840f4ce Compare March 31, 2025 16:23
@markmc
Copy link
Member Author

markmc commented Mar 31, 2025

Rebased to pull in CI fix from #15757

@WoosukKwon
Copy link
Collaborator

Looks generally good to me. However, as we discussed offline, I think the draft & accepted number of tokens per position (0, 1, ..., num_speculative_tokens-1) would be very much useful to understand how well the draft model is doing. WDYT?

@markmc
Copy link
Member Author

markmc commented Apr 1, 2025

Looks generally good to me. However, as we discussed offline, I think the draft & accepted number of tokens per position (0, 1, ..., num_speculative_tokens-1) would be very much useful to understand how well the draft model is doing. WDYT?

Absolutely. I think there's a lot more ideas to explore. I'd like to take this incrementally though and start by adding back the acceptance rate metric that we had in V0. This lays the groundwork for adding future metrics also.

@markmc
Copy link
Member Author

markmc commented Apr 1, 2025

Looks generally good to me. However, as we discussed offline, I think the draft & accepted number of tokens per position (0, 1, ..., num_speculative_tokens-1) would be very much useful to understand how well the draft model is doing. WDYT?

Absolutely. I think there's a lot more ideas to explore. I'd like to take this incrementally though and start by adding back the acceptance rate metric that we had in V0. This lays the groundwork for adding future metrics also.

Unless we think the acceptance rate metric (i.e. num_draft and num_accepted) itself is useless and we want to deprecate those too! I don't believe so, though. I think we will continue to want a single metric to compare performance over time and between models.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmc Got it. Looks good to me as the first step.

@WoosukKwon WoosukKwon merged commit a79cc68 into vllm-project:main Apr 1, 2025
32 checks passed
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Apr 2, 2025
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[V1] [Spec Decode] Add metrics (acceptance rate, number of accepted tokens per step) for V1 rejection sampler
2 participants