[Optimize] optimize mask_quant & swiglu#6222
Merged
Merged
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6222 +/- ##
==========================================
Coverage ? 67.00%
==========================================
Files ? 385
Lines ? 51283
Branches ? 7998
==========================================
Hits ? 34362
Misses ? 14430
Partials ? 2491
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
Author
|
/re-run all-failed |
Contributor
Author
|
/re-run all-failed |
Contributor
Author
|
/re-run all-failed |
K11OntheBoat
previously approved these changes
Jan 29, 2026
qingqing01
previously approved these changes
Jan 30, 2026
1dc458d
qingqing01
approved these changes
Jan 30, 2026
Contributor
Author
|
/re-run all-failed |
qingqing01
previously approved these changes
Feb 2, 2026
a9846ea
yongqiangma
approved these changes
Feb 2, 2026
EmmonsCurse
approved these changes
Feb 2, 2026
fxyfxy777
added a commit
to fxyfxy777/FastDeploy
that referenced
this pull request
Feb 2, 2026
This reverts commit 2ada119.
K11OntheBoat
pushed a commit
that referenced
this pull request
Feb 3, 2026
* optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant * Revert "[Optimize] optimize mask_quant & swiglu (#6222)" This reverts commit 2ada119. * add block_size * pre-commit
kesmeey
pushed a commit
to kesmeey/FastDeploy
that referenced
this pull request
Feb 22, 2026
* optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant
kesmeey
pushed a commit
to kesmeey/FastDeploy
that referenced
this pull request
Feb 22, 2026
* optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant * Revert "[Optimize] optimize mask_quant & swiglu (PaddlePaddle#6222)" This reverts commit 2ada119. * add block_size * pre-commit
chang-wenbin
pushed a commit
to chang-wenbin/FastDeploy
that referenced
this pull request
Mar 2, 2026
* optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant
chang-wenbin
pushed a commit
to chang-wenbin/FastDeploy
that referenced
this pull request
Mar 2, 2026
* optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant * Revert "[Optimize] optimize mask_quant & swiglu (PaddlePaddle#6222)" This reverts commit 2ada119. * add block_size * pre-commit
xiaoguoguo626807
pushed a commit
to xiaoguoguo626807/FastDeploy
that referenced
this pull request
May 7, 2026
* optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant
xiaoguoguo626807
pushed a commit
to xiaoguoguo626807/FastDeploy
that referenced
this pull request
May 7, 2026
* optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant * Revert "[Optimize] optimize mask_quant & swiglu (PaddlePaddle#6222)" This reverts commit a33739d. * add block_size * pre-commit
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Modifications
from fastdeploy.model_executor.ops.gpu import group_swiglu_with_masked
from fastdeploy.model_executor.ops.gpu import masked_per_token_quant
融合上述两个算子为fused_mask_swiglu_fp8_quant
去掉了fp16的支持,暂时看没有需要调用的地方
去掉了输入支持int64的场景,同样是没有需求
支持ue8m0的场景
精度:
bd7b915
这个commit中测试了融合后的算子和融合之前的算子逐位对齐的
删去了mask_per_token_quant算子,mask_swiglu算子在别的文件(custom_ops/gpu_ops/moe/moe_ffn.cu ,custom_ops/gpu_ops/moe/moe_expert_ffn_wint2.cu)中有调用,暂时先不删除
性能结论:测试数据:self.group_num = 10


self.group_size = 2048
self.hidden_dim = 7168
self.block_size = 128
每个rank10个专家,有效token数在0-512的范围内,
H 卡替换比:(约1.6倍)
B卡替换比:(约2倍)
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.