-
Notifications
You must be signed in to change notification settings - Fork 11.9k
ggml: aarch64: Implement SVE F32 kernels for Mamba Model #13602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hi @ggerganov please support to review this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to split the PR in 2 parts. First part with:
- Add SVE support for ggml_vec_dot_f32() function.
- Add SVE support for ggml_vec_mad_f32() function.
- Add SVE support for ggml_vec_scale_f32() function.
The second part with Mamba-specific changes.
For the first part I need to see what is the improvement over the existing GGML_SIMD
implementation using ARM_NEON
for example, which AFAIK should always be available when SVE is available.
#if defined(__ARM_FEATURE_SVE) | ||
|
||
GGML_F32_VEC vx = GGML_F32_VEC_SET1(v); | ||
const int sve_register_length = ggml_cpu_get_sve_cnt() * 8; | ||
const int ggml_f32_epr = sve_register_length / 32;//8;//svcntw(); // SVE128:4, SVE256:8, SVE512:16 | ||
const int ggml_f32_step = 2 * ggml_f32_epr; | ||
GGML_F32_VEC vx = GGML_F32_VEC_SET1(v); | ||
|
||
GGML_F32_VEC ax[GGML_F32_ARR]; | ||
GGML_F32_VEC ay[GGML_F32_ARR]; | ||
const int np = (n & ~(ggml_f32_step - 1)); | ||
svfloat32_t ax1,ax2; | ||
svfloat32_t ay1,ay2; | ||
for ( int i = 0; i < np; i += ggml_f32_step) { | ||
|
||
for (int i = 0; i < np; i += GGML_F32_STEP) { | ||
for (int j = 0; j < GGML_F32_ARR; j++) { | ||
ax[j] = GGML_F32_VEC_LOAD(x + i + j*GGML_F32_EPR); | ||
ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR); | ||
ay[j] = GGML_F32_VEC_FMA(ay[j], ax[j], vx); | ||
ax1 = GGML_F32_VEC_LOAD(x + i); | ||
ay1 = GGML_F32_VEC_LOAD(y + i); | ||
ay1 = GGML_F32_VEC_FMA(ax1, vx, ay1); | ||
|
||
GGML_F32_VEC_STORE(y + i + j*GGML_F32_EPR, ay[j]); | ||
GGML_F32_VEC_STORE(y + i, ay1); | ||
|
||
ax2 = GGML_F32_VEC_LOAD(x + i + 1*ggml_f32_epr); | ||
ay2 = GGML_F32_VEC_LOAD(y + i + 1*ggml_f32_epr); | ||
ay2 = GGML_F32_VEC_FMA(ax2, vx, ay2); | ||
|
||
GGML_F32_VEC_STORE(y + i + 1*ggml_f32_epr, ay2); | ||
} | ||
} | ||
// leftovers | ||
// maximum number of leftover elements will be less that ggml_f32_epr. Apply predicated svmad on available elements only | ||
if(np<n) { | ||
svbool_t pg =svwhilelt_b32(np, n); | ||
ax1 = svld1_f32(pg, x + np); | ||
ay1 = svld1_f32(pg, y + np); | ||
ay1 = svmad_f32_m(pg, ax1, vx, ay1); | ||
|
||
svst1_f32(pg, y + np, ay1); | ||
} | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big is the benefit from these special-cased implementations compared to using the GGML_SIMD
abstraction? If the benefit is not significant, it's better to use the existing implementation and avoid this extra code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ggerganov,
The function ggml_vec_mad_f32() is called from flash attention operation. Currently this operation is supported in some models like phi-3-7B, falcon-7B and Qwen-4k. I have converted all these to F32 gguf format. But none of the models are using ggml_compute_forward_flash_attn_back_f32() instead they are using ggml_compute_forward_flash_attn_ext_f16() which doesnot call ggml_vec_mad_f32() . For this reason I could'nt show the performance results. But, I can assure there will be good benifit compared to Neon version if model uses this function because we saw speed up for ggml_vec_dot_f32() which is similar to this function.
Will it be fine to proceed pushing this function?
/* Below function was borrowed from the GitHub repository: | ||
https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/common.hpp */ | ||
#if defined(__ARM_FEATURE_SVE) && defined(__aarch64__) | ||
inline static svfloat32_t exp_ps_sve(svbool_t pg, svfloat32_t src) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that it is a good idea to borrow code like this. Better implement this from scratch, or use scalar code if it is too difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ggerganov,
We are not calling this function from external library. This function is already implemented from scratch by our team and we are not borrowing from external source. We will remove the comment which is mentioned above the function.
This PR adds SVE kernel support for F32 datatype specific to Mamba Model on ARM architecture.
Major code changes:
Performance
This PR improves performance by ~1.3x compared to the previous NEON-based implementation.
Model: falcon-mamba-7B-F32.gguf
Command: ./build/bin/llama-bench -m falcon-mamba-7B-F32.gguf -t 8,16,32,64 -p 128,1024 -n 0
Perplexity
There is no change in model accuracy as a result of this PR.
Command: ./build/bin/llama-perplexity -s 0 -np 128 -t 64 -m falcon-mamba-7B-F32.gguf -c 128 -b 128 --chunks 16 -f scripts/wikitext-2-raw/wiki.test.raw
Contributor: Vineel Abhinav Gottala
cc: @Vithulep