-
Notifications
You must be signed in to change notification settings - Fork 4
Add Sglang deployment instructions #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
|
cc @tugot17 |
|
|
||
| ## Installation | ||
|
|
||
| Install SGLang following the [official installation guide](https://docs.sglang.io/get_started/install_sglang.html). The recommended method is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specify the version required (>= 0.5.8)
| --shm-size 32g \ | ||
| -p 30000:30000 \ | ||
| -v ~/.cache/huggingface:/root/.cache/huggingface \ | ||
| --env "HF_TOKEN=<secret>" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need HF token?
| --model-path LiquidAI/LFM2.5-1.2B-Instruct \ | ||
| --host 0.0.0.0 \ | ||
| --port 30000 \ | ||
| --chunked-prefill-size -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not be on by default
| --model LiquidAI/LFM2.5-1.2B-Instruct \ | ||
| --host 0.0.0.0 \ | ||
| --port 30000 \ | ||
| --chunked-prefill-size -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, this should not be on by default
| Start the SGLang server with the following command: | ||
|
|
||
| ```bash | ||
| python3 -m sglang.launch_server \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sglang serve is simpler
|
|
||
| * `--chunked-prefill-size -1`: Disables chunked prefill for lower latency | ||
|
|
||
| ### Ultra Low Latency on Blackwell (B300) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be a seperate section overall. Low latency, and all of the flags
| With this configuration, end-to-end latency can be as low as **180ms** per request. Benchmark results on a B300 GPU with CUDA 13: | ||
|
|
||
| ``` | ||
| ============ Serving Benchmark Result ============ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need all of these in docs, we can compress this to (for bs of size k, l tokens in m tokens out we get this and this throughput
| ``` | ||
| </Accordion> | ||
|
|
||
| ## Tool Calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can compress it, can just in the main one show how to run it with tool parser
|
|
||
| For more details on tool parsing configuration, see the [SGLang Tool Parser documentation](https://docs.sglang.io/advanced_features/tool_parser.html). | ||
|
|
||
| ## Vision Models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not yet officially supported (not merged into sglang), we should dorp it
| @@ -0,0 +1,403 @@ | |||
| --- | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also should have a section on offline inference I think
| ``` | ||
| </Accordion> | ||
|
|
||
| ## Chat Completions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is standard to sglang I think, quick example how to call the model, with OAI client and curl I think are sufficient
| @@ -0,0 +1,403 @@ | |||
| --- | |||
| title: "SGLang" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another thing, to run this in different precision, e.g. float16 you need to explicitly set the flag to export SGLANG_MAMBA_CONV_DTYPE=float16 we should mention this
No description provided.