[Iluvatar GPU] Optimze attention and moe performance#3234
Conversation
863a582 to
9516b2d
Compare
|
Thanks for your contribution! |
a315aa2 to
9748bd1
Compare
1a8a265 to
5cb095b
Compare
5cb095b to
c9a0ffd
Compare
|
|
||
|
|
||
| class IluvatarWorker(WorkerBase): | ||
| class IluvatarWorker(GpuWorker): |
There was a problem hiding this comment.
iluvatar 的执行流程为什么要跟 gpu 耦合在一起?
There was a problem hiding this comment.
iluvatar 的执行流程为什么要跟 gpu 耦合在一起?
这次从0630升级到最新commit发现gpu_model_runner改变很大,而之前适配的iluvatar_model_runner就是从gpu_model_runner copy的,除了import部分算子有区别,其他都是一样的。这次这样改动,以后在升级适配可以直接服用gpu_model_runner的流程,避免在做copy的重复工作,如果出现了不兼容的流程,我们会在iluvatar_model_runner重新覆盖该成员函数保证可以work
There was a problem hiding this comment.
iluvatar 的执行流程为什么要跟 gpu 耦合在一起?
这次从0630升级到最新commit发现gpu_model_runner改变很大,而之前适配的iluvatar_model_runner就是从gpu_model_runner copy的,除了import部分算子有区别,其他都是一样的。这次这样改动,以后在升级适配可以直接服用gpu_model_runner的流程,避免在做copy的重复工作,如果出现了不兼容的流程,我们会在iluvatar_model_runner重新覆盖该成员函数保证可以work
好的,6.30之后执行流程一直在快速迭代,等执行器稳定后不同硬件的执行器需要重新隔离开
| def initialize_cache(self, num_gpu_blocks: int) -> None: | ||
| """ """ | ||
| self.model_runner.update_share_input_block_num(num_gpu_blocks=num_gpu_blocks) | ||
| class IluvatarPaddleDisWorkerProc(PaddleDisWorkerProc): |
There was a problem hiding this comment.
worker proc 不应该放在 worker 层级,这里留个 TODO 项吧。之前的没有预留支持多个 worker proc 的接口, 待FastDeploy 的执行器重构后能架构会更合理一些。
| up_gate_proj_weight: paddle.Tensor, | ||
| down_proj_weight: paddle.Tensor, | ||
| up_gate_proj_bias: Optional[paddle.Tensor], | ||
| up_gate_proj_scale: Optional[paddle.Tensor], | ||
| down_proj_scale: Optional[paddle.Tensor], | ||
| down_proj_in_scale: Optional[paddle.Tensor], | ||
| ffn1_weight: paddle.Tensor, | ||
| ffn2_weight: paddle.Tensor, | ||
| ffn1_bias: Optional[paddle.Tensor], | ||
| ffn1_scale: Optional[paddle.Tensor], | ||
| ffn2_scale: Optional[paddle.Tensor], | ||
| ffn2_in_scale: Optional[paddle.Tensor], | ||
| expert_idx_per_token: Optional[paddle.Tensor], | ||
| quant_method: str, | ||
| used_in_ep_low_latency: bool, | ||
| ): | ||
| assert up_gate_proj_bias is None | ||
| assert up_gate_proj_scale is not None | ||
| assert down_proj_scale is not None | ||
| assert down_proj_in_scale is None | ||
| assert ffn1_bias is None | ||
| assert ffn1_scale is not None | ||
| assert ffn2_scale is not None | ||
| assert ffn2_in_scale is None | ||
| assert expert_idx_per_token is None | ||
| assert quant_method in ("weight_only_int8") | ||
| assert not used_in_ep_low_latency | ||
| tokens_expert_prefix_sum_cpu = tokens_expert_prefix_sum.to("cpu") | ||
| up_gate_proj_output = paddle.empty( | ||
| [permute_input.shape[0], up_gate_proj_weight.shape[1]], | ||
| dtype=permute_input.dtype, | ||
| ) | ||
| group_gemm( | ||
| permute_input, | ||
| tokens_expert_prefix_sum_cpu, | ||
| up_gate_proj_weight, | ||
| up_gate_proj_scale, | ||
| up_gate_proj_output, | ||
| ) | ||
| act_out = swiglu(up_gate_proj_output) | ||
| output = paddle.empty([act_out.shape[0], down_proj_weight.shape[1]], dtype=act_out.dtype) | ||
| group_gemm( | ||
| act_out, | ||
| tokens_expert_prefix_sum_cpu, | ||
| down_proj_weight, | ||
| down_proj_scale, | ||
| output, | ||
| ) |
There was a problem hiding this comment.
这些建议保持原命名,FD中不建议出现ffn1/ffn2字样
FD在天数硬件上的第一版性能优化,具体优化策略有:
该版基于GSM8K数据集跑erine45 300B模型总体耗时约6.3h,精度0.964