Use myst links where possible

hmellor · hmellor · commit 1876f9cc5123 · 2025-02-19T20:10:52.000+01:00
Signed-off-by: Harry Mellor &lt;19981378+hmellor@users.noreply.github.com&gt;
diff --git a/docs/source/design/v1/metrics.md b/docs/source/design/v1/metrics.md
@@ -57,11 +57,11 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics`
 - `vllm:spec_decode_num_draft_tokens_total` (Counter)
 - `vllm:spec_decode_num_emitted_tokens_total` (Counter)
 
-These are documented under [Inferencing and Serving -> Production Metrics](https://docs.vllm.ai/en/stable/serving/metrics.html).
+These are documented under [Inferencing and Serving -> Production Metrics](project:../../serving/metrics.md).
 
 ### Grafana Dashboard
 
-vLLM also provides [a reference example](https://docs.vllm.ai/en/latest/getting_started/examples/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
+vLLM also provides [a reference example](project:../../getting_started/examples/prometheus_grafana.md) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
 
 The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
 
@@ -80,15 +80,15 @@ The subset of metrics exposed in the Grafana dashboard gives us an indication of
 - `vllm:request_decode_time_seconds` - Requests Decode Time
 - `vllm:request_max_num_generation_tokens` - Max Generation Token in Sequence Group
 
-See [the PR which added this Dashboard](https://github.com/vllm-project/vllm/pull/2316) for interesting and useful background on the choices made here.
+See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here.
 
 ### Prometheus Client Library
 
-Prometheus support was initially added [using the aioprometheus library](https://github.com/vllm-project/vllm/pull/1890), but a switch was made quickly to [prometheus_client](https://github.com/vllm-project/vllm/pull/2730). The rationale is discussed in both linked PRs.
+Prometheus support was initially added [using the aioprometheus library](gh-pr:1890), but a switch was made quickly to [prometheus_client](gh-pr:2730). The rationale is discussed in both linked PRs.
 
 ### Multi-process Mode
 
-In v0, metrics are collected in the engine core process and we use multi-process mode to make them available in the API server process. See [#7279](https://github.com/vllm-project/vllm/pull/7279).
+In v0, metrics are collected in the engine core process and we use multi-process mode to make them available in the API server process. See <gh-pr:7279>.
 
 ### Built in Python/Process Metrics
 
@@ -114,32 +114,32 @@ vLLM instance.
 
 For background, these are some of the relevant PRs which added the v0 metrics:
 
-- [#1890](https://github.com/vllm-project/vllm/pull/1890)
-- [#2316](https://github.com/vllm-project/vllm/pull/2316)
-- [#2730](https://github.com/vllm-project/vllm/pull/2730)
-- [#4464](https://github.com/vllm-project/vllm/pull/4464)
-- [#7279](https://github.com/vllm-project/vllm/pull/7279)
+- <gh-pr:1890>
+- <gh-pr:2316>
+- <gh-pr:2730>
+- <gh-pr:4464>
+- <gh-pr:7279>
 
-Also note the ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where e.g. [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
+Also note the ["Even Better Observability"](gh-issue:3616) feature where e.g. [a detailed roadmap was laid out](gh-issue:3616#issuecomment-2030858781).
 
 ## v1 Design
 
 ### v1 PRs
 
 For background, here are the relevant v1 PRs relating to the v1
-metrics issue #10582:
-
-- [#11962](https://github.com/vllm-project/vllm/pull/11962)
-- [#11973](https://github.com/vllm-project/vllm/pull/11973)
-- [#10907](https://github.com/vllm-project/vllm/pull/10907)
-- [#12416](https://github.com/vllm-project/vllm/pull/12416)
-- [#12478](https://github.com/vllm-project/vllm/pull/12478)
-- [#12516](https://github.com/vllm-project/vllm/pull/12516)
-- [#12530](https://github.com/vllm-project/vllm/pull/12530)
-- [#12561](https://github.com/vllm-project/vllm/pull/12561)
-- [#12579](https://github.com/vllm-project/vllm/pull/12579)
-- [#12592](https://github.com/vllm-project/vllm/pull/12592)
-- [#12644](https://github.com/vllm-project/vllm/pull/12644)
+metrics issue <gh-issue:10582>:
+
+- <gh-pr:11962>
+- <gh-pr:11973>
+- <gh-pr:10907>
+- <gh-pr:12416>
+- <gh-pr:12478>
+- <gh-pr:12516>
+- <gh-pr:12530>
+- <gh-pr:12561>
+- <gh-pr:12579>
+- <gh-pr:12592>
+- <gh-pr:12644>
 
 ### Metrics Collection
 
@@ -212,26 +212,26 @@ And the calculated intervals are:
 Put another way:
 
 ```text
-  << queued timestamp >>
-    [ queue interval ]
-      |
-      | (possible preemptions)
-      | << scheduled timestamp >>
-      | << preempted timestamp >>
-      | << scheduled timestamp >>
-      | << new token timestamp (FIRST) >>
-      | << new token timestamp >>
-      | << new token timestamp >>
-      | << preempted timestamp >>
-      v
-  << scheduled timestamp >>
-    [ prefill interval ]
-  << new token timestamp (FIRST) >>
-    [ inter-token interval ]
-  << new token timestamp >>
-    [ decode interval (relative to most recent first token time)
-    [ inference interval (relative to most recent scheduled time)
-  << new token timestamp (FINISHED) >>
+<< queued timestamp >>
+  [ queue interval ]
+    |
+    | (possible preemptions)
+    | << scheduled timestamp >>
+    | << preempted timestamp >>
+    | << scheduled timestamp >>
+    | << new token timestamp (FIRST) >>
+    | << new token timestamp >>
+    | << new token timestamp >>
+    | << preempted timestamp >>
+    v
+<< scheduled timestamp >>
+  [ prefill interval ]
+<< new token timestamp (FIRST) >>
+  [ inter-token interval ]
+<< new token timestamp >>
+  [ decode interval (relative to most recent first token time) ]
+  [ inference interval (relative to most recent scheduled time) ]
+<< new token timestamp (FINISHED) >>
 ```
 
 We explored the possibility of having the frontend calculate these
@@ -365,7 +365,7 @@ However, `prometheus_client` has [never supported Info metrics in
 multiprocessing
 mode](https://github.com/prometheus/client_python/pull/300) - for
 [unclear
-reasons](https://github.com/vllm-project/vllm/pull/7279#discussion_r1710417152). We
+reasons](gh-pr:7279#discussion_r1710417152). We
 simply use a `Gauge` metric set to 1 and
 `multiprocess_mode="mostrecent"` instead.
 
@@ -391,7 +391,7 @@ Note that `multiprocess_mode="livemostrecent"` is used - the most
 recent metric is used, but only from currently running processes.
 
 This was added in
-[#9477](https://github.com/vllm-project/vllm/pull/9477) and there is
+<gh-pr:9477> and there is
 [at least one known
 user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). If
 we revisit this design and deprecate the old metric, we should reduce
@@ -400,7 +400,7 @@ v0 also and asking this project to move to the new metric.
 
 ### Prefix Cache metrics
 
-The discussion in #10582 about adding prefix cache metrics yielded
+The discussion in <gh-issue:10582> about adding prefix cache metrics yielded
 some interesting points which may be relevant to how we approach
 future metrics.
 
@@ -437,11 +437,11 @@ suddenly (from their perspective) when it is removed, even if there is
 an equivalent metric for them to use.
 
 As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was
-[deprecated](https://github.com/vllm-project/vllm/pull/2764) (with a
+[deprecated](gh-pr:2764) (with a
 comment in the code),
-[removed](https://github.com/vllm-project/vllm/pull/12383), and then
+[removed](gh-pr:12383), and then
 [noticed by a
-user](https://github.com/vllm-project/vllm/issues/13218).
+user](gh-issue:13218).
 
 In general:
 
@@ -458,20 +458,20 @@ In general:
 
 ### Unimplemented - `vllm:tokens_total`
 
-Added by #4464, but apparently never implemented. This can just be
+Added by <gh-pr:4464>, but apparently never implemented. This can just be
 removed.
 
 ### Duplicated - Queue Time
 
 The `vllm:time_in_queue_requests` Histogram metric was added by
-#9659 and its calculation is:
+<gh-pr:9659> and its calculation is:
 
 ```
     self.metrics.first_scheduled_time = now
     self.metrics.time_in_queue = now - self.metrics.arrival_time
 ```
 
-Two weeks later, #4464 added `vllm:request_queue_time_seconds` leaving
+Two weeks later, <gh-pr:4464> added `vllm:request_queue_time_seconds` leaving
 us with:
 
 ```
@@ -510,7 +510,7 @@ memory. This is also known as "KV cache offloading" and is configured
 with `--swap-space` and `--preemption-mode`.
 
 In v0, [VLLM has long supported beam
-search](https://github.com/vllm-project/vllm/issues/6226). The
+search](gh-issue:6226). The
 SequenceGroup encapsulated the idea of N Sequences which
 all shared the same prompt kv blocks. This enabled KV cache block
 sharing between requests, and copy-on-write to do branching. CPU
@@ -524,7 +524,7 @@ and the part of the prompt that was evicted can be recomputed.
 SequenceGroup was removed in V1, although a replacement will be
 required for "parallel sampling" (`n>1`). [Beam search was moved out of
 the core (in
-V0)](https://github.com/vllm-project/vllm/issues/8306). There was a
+V0)](gh-issue:8306). There was a
 lot of complex code for a very uncommon feature.
 
 In V1, with prefix caching being better (zero over head) and therefore
@@ -539,8 +539,7 @@ Some v0 metrics are only relevant in the context of "parallel
 sampling". This is where the `n` parameter in a request is used to
 request multiple completions from the same prompt.
 
-As part of [adding parallel sampling support in
-#10980](https://github.com/vllm-project/vllm/pull/10980) we should
+As part of adding parallel sampling support in <gh-pr:10980> we should
 also add these metrics.
 
 - `vllm:request_params_n` (Histogram)
@@ -565,7 +564,7 @@ model and then validate those tokens with the larger model.
 - `vllm:spec_decode_num_draft_tokens_total` (Counter)
 - `vllm:spec_decode_num_emitted_tokens_total` (Counter)
 
-There is a PR under review (#12193) to add "prompt lookup (ngram)"
+There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
 seculative decoding to v1. Other techniques will follow. We should
 revisit the v0 metrics in this context.
 
@@ -589,7 +588,7 @@ see:
   Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
 - [Inference
   Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
-- #5041 and #12726.
+- <gh-issue:5041> and <gh-pr:12726>.
   
 This is a non-trivial topic. Consider this comment from Rob:
 
@@ -660,13 +659,13 @@ fall under the more general heading of "Observability".
 
 v0 has support for OpenTelemetry tracing:
 
-- Added by #4687
+- Added by <gh-pr:4687>
 - Configured with `--oltp-traces-endpoint` and
   `--collect-detailed-traces`
 - [OpenTelemetry blog
   post](https://opentelemetry.io/blog/2024/llm-observability/)
 - [User-facing
-  docs](https://docs.vllm.ai/en/latest/getting_started/examples/opentelemetry.html)
+  docs](project:../../getting_started/examples/opentelemetry.md)
 - [Blog
   post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
 - [IBM product
@@ -696,7 +695,7 @@ documentation for this option states:
 > use of possibly costly and or blocking operations and hence might
 > have a performance impact.
 
-The metrics were added by #7089 and who up in an OpenTelemetry trace
+The metrics were added by <gh-pr:7089> and who up in an OpenTelemetry trace
 as:
 
 ```