-
Couldn't load subscription status.
- Fork 293
【Hackathon 9th No.91】FastDeploy中的MoE GroupGEMM支持INT8*INT8实现 #1164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
WanRui37
wants to merge
7
commits into
PaddlePaddle:master
Choose a base branch
from
WanRui37:rfc_002
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
972d105
v1: Simply fill in the RFC
WanRui37 19a1c75
v1: Simply fill in the RFC
WanRui37 c5b871c
Merge branch 'PaddlePaddle:master' into rfc_002
WanRui37 5a34a4e
v1: Simply fill in the RFC
WanRui37 09b6eb3
v2: Added some design ideas
WanRui37 e235203
Merge branch 'PaddlePaddle:master' into rfc_002
WanRui37 754593c
v2: Added some design ideas
WanRui37 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
69 changes: 69 additions & 0 deletions
69
rfcs/FastDeploy/20251016_FastDeploy_add_moe_groupgemm_int8_int8.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| # FastDeploy中的MoE GroupGEMM支持INT8*INT8实现 | ||
|
|
||
| | 方案名称 | FastDeploy中的MoE GroupGEMM支持INT8*INT8实现 | | ||
| |----------------------------------------------------------|-------------------------------------------| | ||
| | 提交作者<input type="checkbox" class="rowselector hidden"> | WanRui37 | | ||
| | 提交时间<input type="checkbox" class="rowselector hidden"> | 2025-10-16 | | ||
| | 版本号 | V1.1 | | ||
| | 依赖飞桨版本<input type="checkbox" class="rowselector hidden"> | paddlepaddle-gpu==3.2.0 | | ||
| | 文件名 | 20251016_FastDeploy_add_moe_groupgemm_int8_int8.md<br> | | ||
|
|
||
| # 一、概述 | ||
| ## 1、相关背景 | ||
| 大规模模型在自然语言处理、计算机视觉等领域取得了显著成果。其中,混合专家模型(Mixture of Experts,MoE)作为一种高效的模型架构,通过将输入数据分配给不同的专家子网络进行处理,能够有效提升模型的性能和计算效率。在MoE模型的训练和推理过程中,GroupGEMM(Group General Matrix Multiply)操作是核心计算步骤 | ||
|
|
||
| ## 2、功能目标 | ||
| 为FastDeploy 开发高性能 MoE算子(INT8*INT8),将上述算子集成到EB、Qwen等开源模型中。 | ||
|
|
||
| ## 3、意义 | ||
| 支持INT8*INT8的MoE GroupGEMM实现能够充分利用硬件的整数计算单元,相较于高精度计算,大幅减少计算延迟,提高模型推理速度。 | ||
|
|
||
| # 二、FastDeploy现状 | ||
| - 目前`FastDeploy`中`MoE GroupGEMM`没有支持`INT8*INT8`的实现 | ||
|
|
||
| # 三、业内方案调研 | ||
| - 目前业内`MoE GroupGEMM`没有支持`INT8*INT8`的实现 | ||
|
|
||
| # 四、设计思路与实现方案 | ||
| 1. 一些参考的代码路径 | ||
| 1. `sm89`架构可以参考 | ||
| - `24_gemm_grouped`: `FastDeploy/custom_ops/third_party/cutlass/examples/24_gemm_grouped` | ||
| - `64_ada_fp8_gemm_grouped`: `FastDeploy/custom_ops/third_party/cutlass/examples/64_ada_fp8_gemm_grouped` | ||
| - 比较重要的就是`cutlass::gemm::device::GemmGrouped` | ||
| 1. `sm90`架构可以参考 | ||
| - `w4afp8_gemm`: `FastDeploy/custom_ops/gpu_ops/w4afp8_gemm` | ||
| - 比较重要的是`cute::gemm` | ||
| - `57_hopper_grouped_gemm`: `FastDeploy/custom_ops/third_party/cutlass/examples/57_hopper_grouped_gemm` | ||
| - 比较重要的是`cutlass::gemm::device::GemmUniversalAdapter` | ||
|
|
||
| 1. 需要修改的代码路径 | ||
| ```text | ||
| custom_ops/ | ||
| └── gpu_ops/ # GPU相关自定义算子 | ||
| ├── int8_gemm_with_cutlass/ # INT8*INT8 GEMM with Cutlass 算子实现 | ||
| │ ├── w8a8_group_gemm.cu # Cutlass Kernel实现 | ||
| │ └── w8a8_group_gemm.h # Cutlass Kernel头文件 | ||
| └── ... | ||
| test/ | ||
| └── operators/ | ||
| ├── test_w8a8_group_gemm.py # 测试INT8*INT8 GEMM with Cutlass 算子 | ||
| └── ... | ||
| ``` | ||
|
|
||
| # 五、测试和验收的考量 | ||
| - 增加算子测试 | ||
| - 在EB,Qwen开源模型上测试数据精度&性能 | ||
|
|
||
| # 六、影响面 | ||
| 为`FastDeploy`集成`MoE GroupGEMM`,不影响其他部分 | ||
|
|
||
| # 七、排期规划 | ||
| * 2025-10-16 ~ 2025-11-16:完成集成代码开发 | ||
| * 2025-11-16 ~ 2025-11-25:完成代码测试 | ||
| * 2025-11-25 ~ 2025-12-01: 完成部署示例及文档 | ||
|
|
||
| # 八、参考资料 | ||
|
|
||
| [Accelerating MoE's with a Triton Persistent Cache-Aware Grouped GEMM Kernel](https://pytorch.org/blog/accelerating-moes-with-a-triton-persistent-cache-aware-grouped-gemm-kernel/) | ||
|
|
||
| [上述为 vllm 增加 BF16 Grouped GEMM Kernel 的 PR](https://github.com/vllm-project/vllm/pull/19443) | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
快速实现可以参考FD已有的wfp8afp8 triton算子,同时可以参考下vllm和TensorRT-LLM的实现方案。不限制CUDA和triton实现方案。如果在完成算子的基础上,可以加入更进一步算子融合(例如GLM4.5-AIR MoE融合共享专家层)。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感谢感谢