opt flashinfer mla cat #5822

xu-yfei · 2025-04-28T07:20:16Z

Motivation

Base on #5748 and #5638 , for flashinfer mla, remove q and k cat.

Accuracy

Accuracy: 0.951
Invalid: 0.000
Latency: 228.672 s
Output throughput: 554.173 token/s

Performance

main branch:

{"run_name": "default", "batch_size": 1, "input_len": 1024, "output_len": 1024, "latency": 14.4687, "output_throughput": 70.77, "overall_throughput": 141.55}

{"run_name": "default", "batch_size": 16, "input_len": 1024, "output_len": 1024, "latency": 28.8723, "output_throughput": 567.47, "overall_throughput": 1134.93}

{"run_name": "default", "batch_size": 32, "input_len": 1024, "output_len": 1024, "latency": 38.0349, "output_throughput": 861.52, "overall_throughput": 1723.05}

this PR:

{"run_name": "default", "batch_size": 1, "input_len": 1024, "output_len": 1024, "latency": 14.5066, "output_throughput": 70.59, "overall_throughput": 141.18}

{"run_name": "default", "batch_size": 16, "input_len": 1024, "output_len": 1024, "latency": 28.4372, "output_throughput": 576.15, "overall_throughput": 1152.29}

{"run_name": "default", "batch_size": 32, "input_len": 1024, "output_len": 1024, "latency": 37.2972, "output_throughput": 878.57, "overall_throughput": 1757.13}

Profile

Prefill

main branch:

this PR:

47 us to 3us

Decode

main branch bs=1 cuda graph+torch compile:

this PR bs=1 cuda graph+torch compile:

bs=1, cuda graph+torch compile, almost the same

main branch bs=1 cuda graph+ without torch compile fused with other ops:

this PR bs=1 cuda graph+ without torch compile fused with other ops:

bs=1, cuda graph, without torch compile, 6~7 us -> 1us

Modifications

Update deepseek_v2 code, remove q and k cat.
In flashinfer_mla_backend:

Cat when ragged, no cat in other scenes.
Use set_mla_kv_buffer when k_rope is not empty

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

lambert0312 · 2025-04-28T08:56:43Z

I pulled the latest commit and did some experiments, and it seems to be consistent with the optimizations mentioned above.

python/sglang/srt/layers/attention/flashinfer_mla_backend.py

xu-yfei requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, ByronHsu and HaiShaw as code owners April 28, 2025 07:20

zhyncs assigned ispobock Apr 28, 2025

opt flashinfer mla cat

69840db

xu-yfei force-pushed the flashinfer_cat_opt branch from 4f25ae3 to 69840db Compare April 28, 2025 07:30

ispobock reviewed Apr 28, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashinfer_mla_backend.py Show resolved Hide resolved

ispobock and others added 4 commits April 28, 2025 20:41

Merge branch 'main' into flashinfer_cat_opt

1fce849

Merge branch 'main' into flashinfer_cat_opt

ba29ddf

Merge branch 'main' into flashinfer_cat_opt

def4b9d

Merge branch 'main' into flashinfer_cat_opt

5421f6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt flashinfer mla cat #5822

opt flashinfer mla cat #5822

xu-yfei commented Apr 28, 2025 •

edited

Loading

lambert0312 commented Apr 28, 2025

opt flashinfer mla cat #5822

Are you sure you want to change the base?

opt flashinfer mla cat #5822

Conversation

xu-yfei commented Apr 28, 2025 • edited Loading

Motivation

Accuracy

Performance

Profile

Prefill

Decode

Modifications

Checklist

lambert0312 commented Apr 28, 2025

xu-yfei commented Apr 28, 2025 •

edited

Loading