Skip to content

[V1][Metrics] Add API for accessing in-memory Prometheus metrics #17010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

markmc
Copy link
Member

@markmc markmc commented Apr 22, 2025

The V0 LLM offline inference API exposes per-request metrics via RequestOutput.RequestMetrics. In V1, so far we have chosen to not track per-request metrics or implement this API.

All recent work implementing EAGLE have been using examples/offline_inference/eagle.py which depends on these metrics to report an aggregated mean acceptance length number.

See e.g. the EAGLE3 PR #16937 which used a WIP implementation of the per-request metrics #16367

The proposal in this PR is to achieve the same aggregated view by using the Prometheus metrics already implemented for the online serving case. This will mean we automatically gain new spec decoding metrics in #16665 for both offline and online inferencing

This does not preclude us from implementing per-request metrics in future in V1 if that proves to be important.

See also the spec decoding metrics design doc.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

mergify bot commented Apr 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 29, 2025
@luyuzhe111
Copy link
Contributor

luyuzhe111 commented Apr 29, 2025

@markmc Can you please provide an example of how to compute acceptance length from the retrieved metrics in examples/offline_inference/eagle.py? is it just

acceptance_length = 1 + (
    metrics.get_value('vllm:spec_decode_num_accepted_tokens') /
    metrics.get_value('vllm:spec_decode_num_drafts')
)

additionally, I'm wondering why vllm:spec_decode_num_accepted_tokens_per_pos is a counter instead of a vector? how is it defined?

thanks again for the PR!

@markmc markmc force-pushed the metrics-v1-offline-api branch from 86757ac to ce88d7a Compare April 29, 2025 14:28
@mergify mergify bot removed the needs-rebase label Apr 29, 2025
@markmc
Copy link
Member Author

markmc commented Apr 29, 2025

@markmc Can you please provide an example of how to compute acceptance length from the retrieved metrics in examples/offline_inference/eagle.py? is it just

acceptance_length = 1 + (
    metrics.get_value('vllm:spec_decode_num_accepted_tokens') /
    metrics.get_value('vllm:spec_decode_num_drafts')
)

additionally, I'm wondering why vllm:spec_decode_num_accepted_tokens_per_pos is a counter instead of a vector? how is it defined?

thanks again for the PR!

All good questions, @luyuzhe111. In fact, I had already rebased onto main, tried to use num_accepted_tokes_per_pos via the API and saw this deficiency!

Try the new version. I've adopted your suggestion of adding a Vector abstraction as well 👍

@markmc markmc added speculative-decoding ready ONLY add when PR is ready to merge/full CI is needed labels Apr 29, 2025
@luyuzhe111
Copy link
Contributor

Hi @markmc, appreciate the fast turn-around! the new version works like a charm. the only request is to add plus 1 to the mean acceptance length since one token will always be accepted. so mean acceptance length is essentially "average number of tokens generated per forward pass". cc @LiuXiaoxuanPKU

@markmc
Copy link
Member Author

markmc commented Apr 30, 2025

the only request is to add plus 1 to the mean acceptance length since one token will always be accepted. so mean acceptance length is essentially "average number of tokens generated per forward pass".

I don't think of the bonus/recovered token as "accepted", particularly in the context of the acceptance rate calculation - the proportion of drafts (speculated tokens) that are accepted

let's take the example from here

num_spec_tokens = 3

drafts:
- #1: 3 accepted
- #2: 1 accepted
- #3: 2 accepted
- #4: 2 accepted
- #5: 1 accepted

observe:
- num_drafts = 5
- num_draft_tokens = 15
- num_accepted_tokens = 9
- accepted_tokens_per_pos = [5, 3, 1]

compute:
- acceptance_rate = 9/15 = 0.6
- mean_acceptance_length = 1.8
- acceptance_probs_per_pos = [1.0, 0.6, 0.2]

You want:

compute:
- acceptance_rate = 9/15 = 0.6
- mean_acceptance_length = 2.8
- acceptance_probs_per_pos = [1.0, 1.0, 0.6, 0.2]

Why? Got any references to show this being common practice? Thanks.

@luyuzhe111
Copy link
Contributor

luyuzhe111 commented Apr 30, 2025

Hi @markmc, as far as I know, all speculative decoding literatures reporting acceptance length includes the bonus token since this quantity aligns with "number of tokens generated per forward pass". All EAGLE papers (EAGLE-1, EAGLE-2, EAGLE-3)'s reported mean acceptance lengths include the +1 bonus token.

alternatively, maybe it's worth keeping the acceptance rate metrics as is but adding another metric for "number of tokens generated per forward pass"?

originally I was actually only suggesting in the examples/offline_inference/eagle.py we do

print(f"mean acceptance length: {1 + num_accepted / num_drafts:.2f}")

instead of

print(f"mean acceptance length: {num_accepted / num_drafts:.2f}")

since we have been also reporting the former quantity in various places such as here.

@markmc
Copy link
Member Author

markmc commented May 9, 2025

Hi @markmc, as far as I know, all speculative decoding literatures reporting acceptance length includes the bonus token since this quantity aligns with "number of tokens generated per forward pass".

Ok, see #17908. Thanks!

@LiuXiaoxuanPKU LiuXiaoxuanPKU self-assigned this May 9, 2025
Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Overall LGTM!
One concern/question I have is whether enabling log_stats for llm_engine could noticeably degrade performance from your experience?

Copy link

mergify bot commented May 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 9, 2025
@markmc markmc force-pushed the metrics-v1-offline-api branch from ce88d7a to d333226 Compare May 12, 2025 17:06
@mergify mergify bot added ci/build and removed needs-rebase labels May 12, 2025
@markmc
Copy link
Member Author

markmc commented May 12, 2025

I've pushed an update that I'm not super happy with

To handle the case of DP where we have multiple sets of metrics identified by engine_idx, I've had to do some nasty consolidation of Histogram and Vector data based on label sets. This will also allow us to expand in future by adding other labels.

@WoosukKwon
Copy link
Collaborator

@markmc Is this PR waiting for review? Or is it in progress?

@markmc
Copy link
Member Author

markmc commented May 13, 2025

@markmc Is this PR waiting for review? Or is it in progress?

It is waiting for review

@LiuXiaoxuanPKU
Copy link
Collaborator

LGTM, @markmc could you just double check if the CI failure is related so that we can merge this PR?

@markmc
Copy link
Member Author

markmc commented May 14, 2025

LGTM, @markmc could you just double check if the CI failure is related so that we can merge this PR?

Yes, AFAICT all of these failures are happening on other PRs too

@WoosukKwon
Copy link
Collaborator

@markmc Can you please merge from main again?

@markmc markmc force-pushed the metrics-v1-offline-api branch from d333226 to 9b13125 Compare May 14, 2025 18:49
@markmc
Copy link
Member Author

markmc commented May 14, 2025

@markmc Can you please merge from main again?

Done. I don't think the rebase resolves any of the test failures, but I could be wrong

Copy link

mergify bot commented May 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 14, 2025
@markmc markmc force-pushed the metrics-v1-offline-api branch from 9b13125 to 2e1d202 Compare May 14, 2025 21:22
@mergify mergify bot removed the needs-rebase label May 14, 2025
@markmc
Copy link
Member Author

markmc commented May 15, 2025

Ok, the docs failure was a genuine - but hard-to-spot - issue with the PR

vllm/docs/source/serving/engine_args.md:14: ERROR: Failed to import "_engine_args_parser" from "vllm.engine.arg_utils".
No module named 'prometheus_client'

markmc added 7 commits May 16, 2025 06:21
prometheus_client does this automatically:

```
def _build_full_name(metric_type, name, namespace, subsystem, unit):
    ...
    if metric_type == 'counter' and full_name.endswith('_total'):
        full_name = full_name[:-6]  # Munge to OpenMetrics.
```

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
In the case of DP, we will have a complete set of metrics for
each DP rank.

We could make get_metrics_snapshot() take a DP rank parameter
to avoid this, but it is possible in future we will add further
dimensions that we want to label on.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
docs/source/serving/engine_args.md:14: ERROR: Failed to import "_engine_args_parser" from "vllm.engine.arg_utils".
No module named 'prometheus_client'
```

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the metrics-v1-offline-api branch from 9a9d1e1 to ac92dde Compare May 16, 2025 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants