Skip to content

Commit 92807d1

Browse files
authored
Update doc on sglang models support. (#1369)
* Added the documention regarding how to configure epp for sglang model servers. * fix a typo
1 parent 9976bd0 commit 92807d1

File tree

2 files changed

+20
-6
lines changed

2 files changed

+20
-6
lines changed

docs/proposals/003-model-server-protocol/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,10 @@ effort.
2121
The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
2222
into the reference endpoint picker implementation.
2323

24-
| Metric | Type | Description | vLLM metric | Triton TensorRT-LLM|
25-
| ----- | ---- | ---- | ---- | ---- |
26-
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`|
27-
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`|
24+
| Metric | Type | Description | vLLM metric | Triton TensorRT-LLM| SGLang |
25+
| ----- | ---- | ---- | ---- | ---- | ---- |
26+
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`| `sglang:num_queue_reqs`
27+
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage`
2828

2929

3030
### LoRA Adapter Serving

site-src/implementations/model-servers.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
2-
31
# Supported Model Servers
42

53
Any model server that conform to the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol) are supported by the inference extension.
@@ -11,6 +9,7 @@ Any model server that conform to the [model server protocol](https://github.com/
119
| vLLM V0 | v0.6.4 and above | [commit 0ad216f](https://github.com/vllm-project/vllm/commit/0ad216f5750742115c686723bf38698372d483fd) | |
1210
| vLLM V1 | v0.8.0 and above | [commit bc32bc7](https://github.com/vllm-project/vllm/commit/bc32bc73aad076849ac88565cff745b01b17d89c) | |
1311
| Triton(TensorRT-LLM) | [25.03](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-03.html#rel-25-03) and above | [commit 15cb989](https://github.com/triton-inference-server/tensorrtllm_backend/commit/15cb989b00523d8e92dce5165b9b9846c047a70d). | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. [Feature request](https://github.com/triton-inference-server/server/issues/8181) |
12+
| SGLang | v0.4.0 and above | [commit 1929c06](https://github.com/sgl-project/sglang/commit/1929c067625089c9c3c04321578f450275f24041) | Set `--enable-metrics` on the model server. LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in SGLang yet.
1413

1514
## vLLM
1615

@@ -36,3 +35,18 @@ Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install the [`i
3635
- --lora-info-metric
3736
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
3837
```
38+
39+
## SGLang
40+
41+
### Edit EPP deployment yaml
42+
43+
Add the following to the `args` of the [EPP deployment](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/manifests/inferencepool-resources.yaml#L32)
44+
45+
```
46+
- --totalQueuedRequestsMetric
47+
- "sglang:num_queue_reqs"
48+
- --kvCacheUsagePercentageMetric
49+
- "sglang:token_usage"
50+
- --lora-info-metric
51+
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by SGLang yet.
52+
```

0 commit comments

Comments
 (0)