-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
[V1][Metrics] Add API for accessing in-memory Prometheus metrics #17010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
@markmc Can you please provide an example of how to compute acceptance length from the retrieved metrics in
additionally, I'm wondering why thanks again for the PR! |
86757ac
to
ce88d7a
Compare
All good questions, @luyuzhe111. In fact, I had already rebased onto main, tried to use Try the new version. I've adopted your suggestion of adding a |
Hi @markmc, appreciate the fast turn-around! the new version works like a charm. the only request is to add plus 1 to the mean acceptance length since one token will always be accepted. so mean acceptance length is essentially "average number of tokens generated per forward pass". cc @LiuXiaoxuanPKU |
I don't think of the bonus/recovered token as "accepted", particularly in the context of the acceptance rate calculation - the proportion of drafts (speculated tokens) that are accepted let's take the example from here
You want:
Why? Got any references to show this being common practice? Thanks. |
Hi @markmc, as far as I know, all speculative decoding literatures reporting acceptance length includes the bonus token since this quantity aligns with "number of tokens generated per forward pass". All EAGLE papers (EAGLE-1, EAGLE-2, EAGLE-3)'s reported mean acceptance lengths include the +1 bonus token. alternatively, maybe it's worth keeping the acceptance rate metrics as is but adding another metric for "number of tokens generated per forward pass"? originally I was actually only suggesting in the
instead of
since we have been also reporting the former quantity in various places such as here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Overall LGTM!
One concern/question I have is whether enabling log_stats for llm_engine could noticeably degrade performance from your experience?
This pull request has merge conflicts that must be resolved before it can be |
ce88d7a
to
d333226
Compare
I've pushed an update that I'm not super happy with To handle the case of DP where we have multiple sets of metrics identified by |
@markmc Is this PR waiting for review? Or is it in progress? |
It is waiting for review |
LGTM, @markmc could you just double check if the CI failure is related so that we can merge this PR? |
Yes, AFAICT all of these failures are happening on other PRs too |
@markmc Can you please merge from main again? |
d333226
to
9b13125
Compare
Done. I don't think the rebase resolves any of the test failures, but I could be wrong |
This pull request has merge conflicts that must be resolved before it can be |
9b13125
to
2e1d202
Compare
Ok, the docs failure was a genuine - but hard-to-spot - issue with the PR
|
prometheus_client does this automatically: ``` def _build_full_name(metric_type, name, namespace, subsystem, unit): ... if metric_type == 'counter' and full_name.endswith('_total'): full_name = full_name[:-6] # Munge to OpenMetrics. ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
In the case of DP, we will have a complete set of metrics for each DP rank. We could make get_metrics_snapshot() take a DP rank parameter to avoid this, but it is possible in future we will add further dimensions that we want to label on. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
``` docs/source/serving/engine_args.md:14: ERROR: Failed to import "_engine_args_parser" from "vllm.engine.arg_utils". No module named 'prometheus_client' ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>
9a9d1e1
to
ac92dde
Compare
The V0 LLM offline inference API exposes per-request metrics via
RequestOutput.RequestMetrics
. In V1, so far we have chosen to not track per-request metrics or implement this API.All recent work implementing EAGLE have been using
examples/offline_inference/eagle.py
which depends on these metrics to report an aggregated mean acceptance length number.See e.g. the EAGLE3 PR #16937 which used a WIP implementation of the per-request metrics #16367
The proposal in this PR is to achieve the same aggregated view by using the Prometheus metrics already implemented for the online serving case. This will mean we automatically gain new spec decoding metrics in #16665 for both offline and online inferencing
This does not preclude us from implementing per-request metrics in future in V1 if that proves to be important.
See also the spec decoding metrics design doc.