Skip to content

Conversation

@vincentzed
Copy link

No description provided.

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
@vincentzed vincentzed requested review from a team and Paulescu as code owners January 31, 2026 04:03
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
@vincentzed
Copy link
Author

cc @tugot17


## Installation

Install SGLang following the [official installation guide](https://docs.sglang.io/get_started/install_sglang.html). The recommended method is:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify the version required (>= 0.5.8)

--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need HF token?

--model-path LiquidAI/LFM2.5-1.2B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--chunked-prefill-size -1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be on by default

--model LiquidAI/LFM2.5-1.2B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--chunked-prefill-size -1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, this should not be on by default

Start the SGLang server with the following command:

```bash
python3 -m sglang.launch_server \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sglang serve is simpler


* `--chunked-prefill-size -1`: Disables chunked prefill for lower latency

### Ultra Low Latency on Blackwell (B300)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a seperate section overall. Low latency, and all of the flags

With this configuration, end-to-end latency can be as low as **180ms** per request. Benchmark results on a B300 GPU with CUDA 13:

```
============ Serving Benchmark Result ============
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need all of these in docs, we can compress this to (for bs of size k, l tokens in m tokens out we get this and this throughput

```
</Accordion>

## Tool Calling
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can compress it, can just in the main one show how to run it with tool parser


For more details on tool parsing configuration, see the [SGLang Tool Parser documentation](https://docs.sglang.io/advanced_features/tool_parser.html).

## Vision Models
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not yet officially supported (not merged into sglang), we should dorp it

@@ -0,0 +1,403 @@
---
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also should have a section on offline inference I think

```
</Accordion>

## Chat Completions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is standard to sglang I think, quick example how to call the model, with OAI client and curl I think are sufficient

@@ -0,0 +1,403 @@
---
title: "SGLang"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another thing, to run this in different precision, e.g. float16 you need to explicitly set the flag to export SGLANG_MAMBA_CONV_DTYPE=float16 we should mention this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants