Update doc on sglang models support. (#1369)

ReneeZhuGG · web-flow · commit 92807d1e41da · 2025-08-15T01:29:07.000-07:00
* Added the documention regarding how to configure epp for sglang model servers.

* fix a typo
diff --git a/docs/proposals/003-model-server-protocol/README.md b/docs/proposals/003-model-server-protocol/README.md
@@ -21,10 +21,10 @@ effort.
 The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
 into the reference endpoint picker implementation.
 
-| Metric | Type | Description | vLLM metric | Triton TensorRT-LLM|
-| ----- | ---- | ---- | ---- | ---- |
-| TotalQueuedRequests         | Gauge     | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`|
-| KVCacheUtilization| Gauge     | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`|
+| Metric | Type | Description | vLLM metric | Triton TensorRT-LLM| SGLang |
+| ----- | ---- | ---- | ---- | ---- | ---- |
+| TotalQueuedRequests         | Gauge     | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`| `sglang:num_queue_reqs`
+| KVCacheUtilization| Gauge     | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage`
 
 
 ### LoRA Adapter Serving
diff --git a/site-src/implementations/model-servers.md b/site-src/implementations/model-servers.md
@@ -1,5 +1,3 @@
-
-
 # Supported Model Servers
 
 Any model server that conform to the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol) are supported by the inference extension.
@@ -11,6 +9,7 @@ Any model server that conform to the [model server protocol](https://github.com/
 | vLLM V0              | v0.6.4 and above                                                                                                       | [commit 0ad216f](https://github.com/vllm-project/vllm/commit/0ad216f5750742115c686723bf38698372d483fd)                            |                                                                                                             |
 | vLLM V1              | v0.8.0 and above                                                                                                       | [commit bc32bc7](https://github.com/vllm-project/vllm/commit/bc32bc73aad076849ac88565cff745b01b17d89c)                            |                                                                                                             |
 | Triton(TensorRT-LLM) | [25.03](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-03.html#rel-25-03) and above | [commit 15cb989](https://github.com/triton-inference-server/tensorrtllm_backend/commit/15cb989b00523d8e92dce5165b9b9846c047a70d). | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. [Feature request](https://github.com/triton-inference-server/server/issues/8181) |
+| SGLang               | v0.4.0 and above | [commit 1929c06](https://github.com/sgl-project/sglang/commit/1929c067625089c9c3c04321578f450275f24041) | Set `--enable-metrics` on the model server. LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in SGLang yet.
 
 ## vLLM
 
@@ -36,3 +35,18 @@ Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install the [`i
 - --lora-info-metric
 - "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
 ```
+
+## SGLang
+
+### Edit EPP deployment yaml
+
+ Add the following to the `args` of the [EPP deployment](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/manifests/inferencepool-resources.yaml#L32)
+
+```
+- --totalQueuedRequestsMetric
+- "sglang:num_queue_reqs"
+- --kvCacheUsagePercentageMetric
+- "sglang:token_usage"
+- --lora-info-metric
+- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by SGLang yet.
+```