-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Apply sgl w8a8 fp8 kernel #3148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Let me bump a new sgl-kernel version to unblock this PR. |
|
@HandH1998 What is the progress of this PR? Please let me know when it is ready. |
@zhyncs two days later |
@merrymercy @zhyncs |
I also added a quantization config |
#3493 @HandH1998 this has been merged |
@zhyncs |
update this to Line 47 in 96263f2
|
need to upload to pypi? |
|
The two falied CIs seems are related with DSv3. I tried to reproduce them locally. But I can't find |
@HandH1998 You can give me the HF user name or use DeepSeek V3/R1 for testing. I have also updated this, so if you wish to upgrade, please update this as well. sglang/scripts/ci_install_dependency.sh Line 29 in 70866b6
|
@HandH1998 Do you think we should support similar api like |
The cutlass w8a8 fp8 kernel only support per-channel activation scales, so I only apply per_token_quant. The |
My HF user name is HandH1998. |
Following #3047, we replace w8a8 fp8 vllm kernel with sgl-kernel. Generally, the w8a8 fp8 sgl-kernel yields higher accuracy on gsm8k. On sm89-L40, the w8a8 fp8 sgl-kernel delivers a 14% higher throughput than the vllm kernel. On sm90-H100, both kernels exhibit similar performance.
Benchmark
model: neuralmagic/Meta-Llama-3-8B-Instruct-FP8
sm89-L40
gsm8k
throughput under various request rates
tok/s
sm90-H100
gsm8k
throughput under various request rates
tok/s
model: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic
activation dynamic quantization
sm89-L40
gsm8k
throughput under various request rates
tok/s
sm90-H100
gsm8k
throughput under various request rates
tok/s