-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[CANN]: add the basic supports of Flash Attention kernel #13627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
BTW, we only test them on 910B. In our school, the CANN environment for 310P server is 7.x, so we cannot compile the llama.cpp with CANN backend. |
We can test it with 310P. |
Evaluation Report on Ascend 910B + Kunpeng 920Authors from Peking University: Bizhao Shi, Yuxin Yang, Ruiyang Ma, Guojie Luo Llama-7B-f16Scripts
With FA
Without FA
Qwen3-14B-Q8_0Scripts
With FA
Without FA
Qwen3-32B-Q8_0Scripts
Without FA
Qwen2-72B-Q8_0Scripts
With FA
Without FA
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’ve implemented FlashAttention (FA) on CANN and provided a comprehensive test report — it looks excellent and is highly meaningful! Thank you so much to you and your colleagues for your valuable contributions to the llama.cpp
project and your support for Huawei Ascend
!
docs/backend/CANN.md
Outdated
|
||
## TODO | ||
- Support more models and data types. | ||
- Support more models and d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here seems to be some documentation errors
ggml_cann_pool_alloc bcast_pse_allocator(ctx.pool()); | ||
void* bcast_pse_buffer = nullptr; | ||
if(src3) | ||
bcast_pse_buffer = bcast_pse_allocator.alloc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the memory allocation here be moved into the src3 != nullptr
block below?
ggml/src/ggml-cann/aclnn_ops.cpp
Outdated
if(src3) | ||
ggml_cann_release_resources(ctx, bcast_pse_tensor); | ||
}else{ | ||
throw std::runtime_error("Function not implemented"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using GGML_ABORT("Function not implemented");
would be a better choice.
ggml/src/ggml-cann/aclnn_ops.cpp
Outdated
#include <string> | ||
#include <cstring> | ||
|
||
#include "aclnnop/aclnn_flash_attention_score.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary imports.
#include "aclnnop/aclnn_flash_attention_score.h"
#include "aclnnop/aclnn_logical_not.h"
ggml/src/ggml-cann/aclnn_ops.cpp
Outdated
@@ -72,12 +72,23 @@ | |||
#include <exception> | |||
#include <vector> | |||
|
|||
#include <iostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary imports.
#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
ggml/src/ggml-cann/aclnn_ops.h
Outdated
@@ -45,6 +45,8 @@ | |||
#include <aclnnop/aclnn_cos.h> | |||
#include <aclnnop/aclnn_log.h> | |||
#include <aclnnop/aclnn_sign.h> | |||
#include <aclnnop/aclnn_fused_infer_attention_score_v2.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest moving this #include <aclnnop/aclnn_fused_infer_attention_score_v2.h>
to the aclnn_ops.cpp
file, and if we don't need aclnn_isneginf
, It can be removed.
We have updated the files according to the review comments. Thanks for your time. @noemotiovon @hipudding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just noticed a few minor issues.
I pulled your latest code and tested the FA operator using a script, but encountered the following problems.
Could you please help me check the cause? Thank you so much!
Environment:
910B3
CANN 8.1 RC1
Script:
./bin/test-backend-ops test -b CANN0 -o FLASH_ATTN_EXT
Error:
Backend 1/2: CANN0
ggml_backend_cann_context: device 0 async operator submission is OFF
Device description: Ascend910B3
Device memory: 62432 MB (62147 MB free)
FLASH_ATTN_EXT(hsk=64,hsv=64,nh=4,nr=1,kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f16,permute=[0,1,2,3]): new_pool_for_device: device 0 use vmm pool
CANN error: EZ9999: Inner Error!
EZ9999: [PID: 3008073] 2025-05-21-09:49:04.442.578 precision mode[2] should be 0 or 1[FUNC:InputAttrsPreProcess][FILE:incre_flash_attention_tiling.cc][LINE:303]
TraceBack (most recent call last):
FusedInferAttentionScore do tiling failed, ret is -1.
Check NnopbaseExecutorDoTiling(executor) failed
Check NnopbaseExecutorTilingAndUpdateBinInfo(executor) failed
Check NnopbaseExecutorMatchCache(executor) failed
Check NnopbaseRunForWorkspace(*executor, workspaceSize) failed
current device: 0, in function ggml_cann_flash_attn_ext at /home/cmq/lcg/github/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2858
aclnnFusedInferAttentionScoreV2GetWorkspaceSize(acl_q_tensor, acl_k_tensor_list, acl_v_tensor_list, bcast_pse_tensor, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, numHeads, scaleValue, preTokens, nextTokens, layout, numKeyValueHeads, sparseMode, innerPrecise, blockSize, antiquantMode, softmaxLseFlag, keyAntiquantMode, valueAntiquantMode, acl_dst_f16_tensor, nullptr, &workspaceSize, &executor)
/home/cmq/lcg/github/llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:65: CANN error
ggml/src/ggml-cann/aclnn_ops.cpp
Outdated
aclTensor* acl_src0_f16_tensor = nullptr; | ||
aclTensor* acl_src1_f16_tensor = nullptr; | ||
aclTensor* acl_src2_f16_tensor = nullptr; | ||
aclTensor* acl_src3_f16_tensor = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable acl_src3_f16_tensor
is not used and can likely be removed.
ggml/src/ggml-cann/aclnn_ops.cpp
Outdated
GGML_ABORT("Function not implemented"); | ||
} | ||
} | ||
Dear authors,
This PR enhances the CANN backend with the FA kernel. Currently, it can only support the F16 KV tensors and no logit softcap. We have tested the kernel on Ascend 910B using test-op-backends.
Thanks.