-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
[V1][Metrics] Initial speculative decoding metrics #15151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
cd3ecad
to
4cac140
Compare
vllm/v1/core/scheduler.py
Outdated
@@ -557,6 +559,7 @@ def update_from_output( | |||
# Otherwise, we ignore the sampler output for the request. | |||
request.num_computed_tokens += num_tokens_scheduled | |||
assert request.num_computed_tokens <= request.num_tokens | |||
spec_decoding_stats.num_emitted_tokens += num_tokens_scheduled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit uncertain whether to include this in the count ... it matches what I was getting at the rejection sampler level, but ... I guess it depends what we want the system efficiency metric to mean.
Will dig into it, but thoughts welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See commit 586f123
[V1][Speculative Decoding] Properly account for ngram drafter
When the ngram drafter proposes zero tokens, we should treat
this as a proposal of num_spec_tokens which were all rejected.
This will allow us to compare fairly across ngram vs draft
models.
Encode a zero-token proposal in ModelRunnerOutput as an
empty list, whereas None means no speculating was done. This
also requires Request.spec_token_ids empty list to be
interpretted this way by the scheduler.
7f4ecd7
to
4cac140
Compare
This pull request has merge conflicts that must be resolved before it can be |
4cac140
to
5bb7953
Compare
This pull request has merge conflicts that must be resolved before it can be |
5bb7953
to
478238d
Compare
I've rebased and pulled out a bunch of stuff from this PR to try and simplify the discussion:
On the topic of properly tracking
This PR will calculate |
I've removed |
Note also, at @WoosukKwon's request we retain the most simplistic behavior for ngram prompt lookup - for |
For reference, here is a mini design doc we've been using to discuss the above |
Fixes vllm-project#13990, part of vllm-project#10582 Omitting system efficiency for now. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Now just num_accepted_tokens, num_draft_tokens, and acceptance rate. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
19b10b8
to
840f4ce
Compare
Rebased to pull in CI fix from #15757 |
Looks generally good to me. However, as we discussed offline, I think the draft & accepted number of tokens per position (0, 1, ..., |
Absolutely. I think there's a lot more ideas to explore. I'd like to take this incrementally though and start by adding back the acceptance rate metric that we had in V0. This lays the groundwork for adding future metrics also. |
Unless we think the acceptance rate metric (i.e. num_draft and num_accepted) itself is useless and we want to deprecate those too! I don't believe so, though. I think we will continue to want a single metric to compare performance over time and between models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markmc Got it. Looks good to me as the first step.
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Fixes #13990, part of #10582
cc @WoosukKwon @LiuXiaoxuanPKU @sroy745