Add Sglang deployment instructions #48

vincentzed · 2026-01-31T04:03:22Z

No description provided.

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>

vincentzed · 2026-02-03T22:11:01Z

tugot17 · 2026-02-04T09:53:33Z

docs/inference/sglang.mdx

+
+## Installation
+
+Install SGLang following the [official installation guide](https://docs.sglang.io/get_started/install_sglang.html). The recommended method is:


specify the version required (>= 0.5.8)

tugot17 · 2026-02-04T09:53:48Z

docs/inference/sglang.mdx

+    --shm-size 32g \
+    -p 30000:30000 \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --env "HF_TOKEN=<secret>" \


why do we need HF token?

tugot17 · 2026-02-04T09:54:02Z

docs/inference/sglang.mdx

+        --model-path LiquidAI/LFM2.5-1.2B-Instruct \
+        --host 0.0.0.0 \
+        --port 30000 \
+        --chunked-prefill-size -1


this should not be on by default

tugot17 · 2026-02-04T09:54:13Z

docs/inference/sglang.mdx

+    --model LiquidAI/LFM2.5-1.2B-Instruct \
+    --host 0.0.0.0 \
+    --port 30000 \
+    --chunked-prefill-size -1


same here, this should not be on by default

tugot17 · 2026-02-04T09:54:52Z

docs/inference/sglang.mdx

+Start the SGLang server with the following command:
+
+```bash
+python3 -m sglang.launch_server \


sglang serve is simpler

tugot17 · 2026-02-04T09:55:31Z

docs/inference/sglang.mdx

+
+* `--chunked-prefill-size -1`: Disables chunked prefill for lower latency
+
+### Ultra Low Latency on Blackwell (B300)


I think this could be a seperate section overall. Low latency, and all of the flags

tugot17 · 2026-02-04T09:56:22Z

docs/inference/sglang.mdx

+With this configuration, end-to-end latency can be as low as **180ms** per request. Benchmark results on a B300 GPU with CUDA 13:
+
+```
+============ Serving Benchmark Result ============


I don't think we need all of these in docs, we can compress this to (for bs of size k, l tokens in m tokens out we get this and this throughput

tugot17 · 2026-02-04T09:56:55Z

docs/inference/sglang.mdx

+  ```
+</Accordion>
+
+## Tool Calling


I think we can compress it, can just in the main one show how to run it with tool parser

tugot17 · 2026-02-04T09:57:17Z

docs/inference/sglang.mdx

+
+For more details on tool parsing configuration, see the [SGLang Tool Parser documentation](https://docs.sglang.io/advanced_features/tool_parser.html).
+
+## Vision Models


this is not yet officially supported (not merged into sglang), we should dorp it

tugot17 · 2026-02-04T10:01:07Z

docs/inference/sglang.mdx

@@ -0,0 +1,403 @@
+---


we also should have a section on offline inference I think

tugot17 · 2026-02-04T10:02:23Z

docs/inference/sglang.mdx

+  ```
+</Accordion>
+
+## Chat Completions


This is standard to sglang I think, quick example how to call the model, with OAI client and curl I think are sufficient

tugot17 · 2026-02-04T10:05:32Z

docs/inference/sglang.mdx

@@ -0,0 +1,403 @@
+---
+title: "SGLang"


another thing, to run this in different precision, e.g. float16 you need to explicitly set the flag to export SGLANG_MAMBA_CONV_DTYPE=float16 we should mention this

WIP add sglang

8342c76

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>

vincentzed requested review from a team and Paulescu as code owners January 31, 2026 04:03

WIP

48724e4

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>

vincentzed force-pushed the vz/sglang branch from e847d46 to 48724e4 Compare January 31, 2026 04:06

tugot17 reviewed Feb 4, 2026

View reviewed changes

docs/inference/sglang.mdx

@@ -0,0 +1,403 @@

---

Copy link

tugot17 Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also should have a section on offline inference I think

tugot17 reviewed Feb 4, 2026

View reviewed changes


		## Installation

		Install SGLang following the [official installation guide](https://docs.sglang.io/get_started/install_sglang.html). The recommended method is:


		* `--chunked-prefill-size -1`: Disables chunked prefill for lower latency

		### Ultra Low Latency on Blackwell (B300)


		For more details on tool parsing configuration, see the [SGLang Tool Parser documentation](https://docs.sglang.io/advanced_features/tool_parser.html).

		## Vision Models

Add Sglang deployment instructions #48

Are you sure you want to change the base?

Add Sglang deployment instructions #48

Conversation

vincentzed commented Jan 31, 2026

Uh oh!

vincentzed commented Feb 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants