You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.|`vllm:num_requests_waiting`|`nv_trt_llm_request_metrics{request_type=waiting}`|
27
-
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.|`vllm:gpu_cache_usage_perc`|`nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`|
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`|`sglang:num_queue_reqs`
27
+
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`|`sglang:token_usage`
Copy file name to clipboardExpand all lines: site-src/implementations/model-servers.md
+16-2Lines changed: 16 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,3 @@
1
-
2
-
3
1
# Supported Model Servers
4
2
5
3
Any model server that conform to the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol) are supported by the inference extension.
@@ -11,6 +9,7 @@ Any model server that conform to the [model server protocol](https://github.com/
11
9
| vLLM V0 | v0.6.4 and above |[commit 0ad216f](https://github.com/vllm-project/vllm/commit/0ad216f5750742115c686723bf38698372d483fd)||
12
10
| vLLM V1 | v0.8.0 and above |[commit bc32bc7](https://github.com/vllm-project/vllm/commit/bc32bc73aad076849ac88565cff745b01b17d89c)||
13
11
| Triton(TensorRT-LLM) |[25.03](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-03.html#rel-25-03) and above |[commit 15cb989](https://github.com/triton-inference-server/tensorrtllm_backend/commit/15cb989b00523d8e92dce5165b9b9846c047a70d). | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. [Feature request](https://github.com/triton-inference-server/server/issues/8181)|
12
+
| SGLang | v0.4.0 and above | [commit 1929c06](https://github.com/sgl-project/sglang/commit/1929c067625089c9c3c04321578f450275f24041) | Set `--enable-metrics` on the model server. LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in SGLang yet.
14
13
15
14
## vLLM
16
15
@@ -36,3 +35,18 @@ Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install the [`i
36
35
- --lora-info-metric
37
36
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
38
37
```
38
+
39
+
## SGLang
40
+
41
+
### Edit EPP deployment yaml
42
+
43
+
Add the following to the `args` of the [EPP deployment](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/manifests/inferencepool-resources.yaml#L32)
44
+
45
+
```
46
+
- --totalQueuedRequestsMetric
47
+
- "sglang:num_queue_reqs"
48
+
- --kvCacheUsagePercentageMetric
49
+
- "sglang:token_usage"
50
+
- --lora-info-metric
51
+
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by SGLang yet.
0 commit comments