Skip to content

Commit 1876f9c

Browse files
committed
Use myst links where possible
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
1 parent 21502e6 commit 1876f9c

File tree

1 file changed

+61
-62
lines changed

1 file changed

+61
-62
lines changed

docs/source/design/v1/metrics.md

Lines changed: 61 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -57,11 +57,11 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics`
5757
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
5858
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
5959

60-
These are documented under [Inferencing and Serving -> Production Metrics](https://docs.vllm.ai/en/stable/serving/metrics.html).
60+
These are documented under [Inferencing and Serving -> Production Metrics](project:../../serving/metrics.md).
6161

6262
### Grafana Dashboard
6363

64-
vLLM also provides [a reference example](https://docs.vllm.ai/en/latest/getting_started/examples/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
64+
vLLM also provides [a reference example](project:../../getting_started/examples/prometheus_grafana.md) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
6565

6666
The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
6767

@@ -80,15 +80,15 @@ The subset of metrics exposed in the Grafana dashboard gives us an indication of
8080
- `vllm:request_decode_time_seconds` - Requests Decode Time
8181
- `vllm:request_max_num_generation_tokens` - Max Generation Token in Sequence Group
8282

83-
See [the PR which added this Dashboard](https://github.com/vllm-project/vllm/pull/2316) for interesting and useful background on the choices made here.
83+
See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here.
8484

8585
### Prometheus Client Library
8686

87-
Prometheus support was initially added [using the aioprometheus library](https://github.com/vllm-project/vllm/pull/1890), but a switch was made quickly to [prometheus_client](https://github.com/vllm-project/vllm/pull/2730). The rationale is discussed in both linked PRs.
87+
Prometheus support was initially added [using the aioprometheus library](gh-pr:1890), but a switch was made quickly to [prometheus_client](gh-pr:2730). The rationale is discussed in both linked PRs.
8888

8989
### Multi-process Mode
9090

91-
In v0, metrics are collected in the engine core process and we use multi-process mode to make them available in the API server process. See [#7279](https://github.com/vllm-project/vllm/pull/7279).
91+
In v0, metrics are collected in the engine core process and we use multi-process mode to make them available in the API server process. See <gh-pr:7279>.
9292

9393
### Built in Python/Process Metrics
9494

@@ -114,32 +114,32 @@ vLLM instance.
114114

115115
For background, these are some of the relevant PRs which added the v0 metrics:
116116

117-
- [#1890](https://github.com/vllm-project/vllm/pull/1890)
118-
- [#2316](https://github.com/vllm-project/vllm/pull/2316)
119-
- [#2730](https://github.com/vllm-project/vllm/pull/2730)
120-
- [#4464](https://github.com/vllm-project/vllm/pull/4464)
121-
- [#7279](https://github.com/vllm-project/vllm/pull/7279)
117+
- <gh-pr:1890>
118+
- <gh-pr:2316>
119+
- <gh-pr:2730>
120+
- <gh-pr:4464>
121+
- <gh-pr:7279>
122122

123-
Also note the ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where e.g. [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
123+
Also note the ["Even Better Observability"](gh-issue:3616) feature where e.g. [a detailed roadmap was laid out](gh-issue:3616#issuecomment-2030858781).
124124

125125
## v1 Design
126126

127127
### v1 PRs
128128

129129
For background, here are the relevant v1 PRs relating to the v1
130-
metrics issue #10582:
131-
132-
- [#11962](https://github.com/vllm-project/vllm/pull/11962)
133-
- [#11973](https://github.com/vllm-project/vllm/pull/11973)
134-
- [#10907](https://github.com/vllm-project/vllm/pull/10907)
135-
- [#12416](https://github.com/vllm-project/vllm/pull/12416)
136-
- [#12478](https://github.com/vllm-project/vllm/pull/12478)
137-
- [#12516](https://github.com/vllm-project/vllm/pull/12516)
138-
- [#12530](https://github.com/vllm-project/vllm/pull/12530)
139-
- [#12561](https://github.com/vllm-project/vllm/pull/12561)
140-
- [#12579](https://github.com/vllm-project/vllm/pull/12579)
141-
- [#12592](https://github.com/vllm-project/vllm/pull/12592)
142-
- [#12644](https://github.com/vllm-project/vllm/pull/12644)
130+
metrics issue <gh-issue:10582>:
131+
132+
- <gh-pr:11962>
133+
- <gh-pr:11973>
134+
- <gh-pr:10907>
135+
- <gh-pr:12416>
136+
- <gh-pr:12478>
137+
- <gh-pr:12516>
138+
- <gh-pr:12530>
139+
- <gh-pr:12561>
140+
- <gh-pr:12579>
141+
- <gh-pr:12592>
142+
- <gh-pr:12644>
143143

144144
### Metrics Collection
145145

@@ -212,26 +212,26 @@ And the calculated intervals are:
212212
Put another way:
213213

214214
```text
215-
<< queued timestamp >>
216-
[ queue interval ]
217-
|
218-
| (possible preemptions)
219-
| << scheduled timestamp >>
220-
| << preempted timestamp >>
221-
| << scheduled timestamp >>
222-
| << new token timestamp (FIRST) >>
223-
| << new token timestamp >>
224-
| << new token timestamp >>
225-
| << preempted timestamp >>
226-
v
227-
<< scheduled timestamp >>
228-
[ prefill interval ]
229-
<< new token timestamp (FIRST) >>
230-
[ inter-token interval ]
231-
<< new token timestamp >>
232-
[ decode interval (relative to most recent first token time)
233-
[ inference interval (relative to most recent scheduled time)
234-
<< new token timestamp (FINISHED) >>
215+
<< queued timestamp >>
216+
[ queue interval ]
217+
|
218+
| (possible preemptions)
219+
| << scheduled timestamp >>
220+
| << preempted timestamp >>
221+
| << scheduled timestamp >>
222+
| << new token timestamp (FIRST) >>
223+
| << new token timestamp >>
224+
| << new token timestamp >>
225+
| << preempted timestamp >>
226+
v
227+
<< scheduled timestamp >>
228+
[ prefill interval ]
229+
<< new token timestamp (FIRST) >>
230+
[ inter-token interval ]
231+
<< new token timestamp >>
232+
[ decode interval (relative to most recent first token time) ]
233+
[ inference interval (relative to most recent scheduled time) ]
234+
<< new token timestamp (FINISHED) >>
235235
```
236236

237237
We explored the possibility of having the frontend calculate these
@@ -365,7 +365,7 @@ However, `prometheus_client` has [never supported Info metrics in
365365
multiprocessing
366366
mode](https://github.com/prometheus/client_python/pull/300) - for
367367
[unclear
368-
reasons](https://github.com/vllm-project/vllm/pull/7279#discussion_r1710417152). We
368+
reasons](gh-pr:7279#discussion_r1710417152). We
369369
simply use a `Gauge` metric set to 1 and
370370
`multiprocess_mode="mostrecent"` instead.
371371

@@ -391,7 +391,7 @@ Note that `multiprocess_mode="livemostrecent"` is used - the most
391391
recent metric is used, but only from currently running processes.
392392

393393
This was added in
394-
[#9477](https://github.com/vllm-project/vllm/pull/9477) and there is
394+
<gh-pr:9477> and there is
395395
[at least one known
396396
user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). If
397397
we revisit this design and deprecate the old metric, we should reduce
@@ -400,7 +400,7 @@ v0 also and asking this project to move to the new metric.
400400

401401
### Prefix Cache metrics
402402

403-
The discussion in #10582 about adding prefix cache metrics yielded
403+
The discussion in <gh-issue:10582> about adding prefix cache metrics yielded
404404
some interesting points which may be relevant to how we approach
405405
future metrics.
406406

@@ -437,11 +437,11 @@ suddenly (from their perspective) when it is removed, even if there is
437437
an equivalent metric for them to use.
438438

439439
As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was
440-
[deprecated](https://github.com/vllm-project/vllm/pull/2764) (with a
440+
[deprecated](gh-pr:2764) (with a
441441
comment in the code),
442-
[removed](https://github.com/vllm-project/vllm/pull/12383), and then
442+
[removed](gh-pr:12383), and then
443443
[noticed by a
444-
user](https://github.com/vllm-project/vllm/issues/13218).
444+
user](gh-issue:13218).
445445

446446
In general:
447447

@@ -458,20 +458,20 @@ In general:
458458

459459
### Unimplemented - `vllm:tokens_total`
460460

461-
Added by #4464, but apparently never implemented. This can just be
461+
Added by <gh-pr:4464>, but apparently never implemented. This can just be
462462
removed.
463463

464464
### Duplicated - Queue Time
465465

466466
The `vllm:time_in_queue_requests` Histogram metric was added by
467-
#9659 and its calculation is:
467+
<gh-pr:9659> and its calculation is:
468468

469469
```
470470
self.metrics.first_scheduled_time = now
471471
self.metrics.time_in_queue = now - self.metrics.arrival_time
472472
```
473473

474-
Two weeks later, #4464 added `vllm:request_queue_time_seconds` leaving
474+
Two weeks later, <gh-pr:4464> added `vllm:request_queue_time_seconds` leaving
475475
us with:
476476

477477
```
@@ -510,7 +510,7 @@ memory. This is also known as "KV cache offloading" and is configured
510510
with `--swap-space` and `--preemption-mode`.
511511

512512
In v0, [VLLM has long supported beam
513-
search](https://github.com/vllm-project/vllm/issues/6226). The
513+
search](gh-issue:6226). The
514514
SequenceGroup encapsulated the idea of N Sequences which
515515
all shared the same prompt kv blocks. This enabled KV cache block
516516
sharing between requests, and copy-on-write to do branching. CPU
@@ -524,7 +524,7 @@ and the part of the prompt that was evicted can be recomputed.
524524
SequenceGroup was removed in V1, although a replacement will be
525525
required for "parallel sampling" (`n>1`). [Beam search was moved out of
526526
the core (in
527-
V0)](https://github.com/vllm-project/vllm/issues/8306). There was a
527+
V0)](gh-issue:8306). There was a
528528
lot of complex code for a very uncommon feature.
529529

530530
In V1, with prefix caching being better (zero over head) and therefore
@@ -539,8 +539,7 @@ Some v0 metrics are only relevant in the context of "parallel
539539
sampling". This is where the `n` parameter in a request is used to
540540
request multiple completions from the same prompt.
541541

542-
As part of [adding parallel sampling support in
543-
#10980](https://github.com/vllm-project/vllm/pull/10980) we should
542+
As part of adding parallel sampling support in <gh-pr:10980> we should
544543
also add these metrics.
545544

546545
- `vllm:request_params_n` (Histogram)
@@ -565,7 +564,7 @@ model and then validate those tokens with the larger model.
565564
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
566565
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
567566

568-
There is a PR under review (#12193) to add "prompt lookup (ngram)"
567+
There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
569568
seculative decoding to v1. Other techniques will follow. We should
570569
revisit the v0 metrics in this context.
571570

@@ -589,7 +588,7 @@ see:
589588
Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
590589
- [Inference
591590
Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
592-
- #5041 and #12726.
591+
- <gh-issue:5041> and <gh-pr:12726>.
593592

594593
This is a non-trivial topic. Consider this comment from Rob:
595594

@@ -660,13 +659,13 @@ fall under the more general heading of "Observability".
660659

661660
v0 has support for OpenTelemetry tracing:
662661

663-
- Added by #4687
662+
- Added by <gh-pr:4687>
664663
- Configured with `--oltp-traces-endpoint` and
665664
`--collect-detailed-traces`
666665
- [OpenTelemetry blog
667666
post](https://opentelemetry.io/blog/2024/llm-observability/)
668667
- [User-facing
669-
docs](https://docs.vllm.ai/en/latest/getting_started/examples/opentelemetry.html)
668+
docs](project:../../getting_started/examples/opentelemetry.md)
670669
- [Blog
671670
post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
672671
- [IBM product
@@ -696,7 +695,7 @@ documentation for this option states:
696695
> use of possibly costly and or blocking operations and hence might
697696
> have a performance impact.
698697
699-
The metrics were added by #7089 and who up in an OpenTelemetry trace
698+
The metrics were added by <gh-pr:7089> and who up in an OpenTelemetry trace
700699
as:
701700

702701
```

0 commit comments

Comments
 (0)