Skip to content

ggml: aarch64: Implement SVE F32 kernels for Mamba Model #13602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

vineelabhinav
Copy link

@vineelabhinav vineelabhinav commented May 17, 2025

This PR adds SVE kernel support for F32 datatype specific to Mamba Model on ARM architecture.
Major code changes:

  1. Add SVE support for ggml_vec_dot_f32() function.
  2. Add SVE support for ggml_compute_forward_ssm_scan_f32() function.
  3. Add SVE support for ggml_vec_mad_f32() function.
  4. Add SVE support for ggml_vec_scale_f32() function.

Performance

This PR improves performance by ~1.3x compared to the previous NEON-based implementation.
Model: falcon-mamba-7B-F32.gguf
Command: ./build/bin/llama-bench -m falcon-mamba-7B-F32.gguf -t 8,16,32,64 -p 128,1024 -n 0

  • Task1: Prompt Length: 128 tokens, Generated Tokens: 1 token
Threads Neon (Tokens/sec) SVE  (Tokens/sec) Ratio
8 9.21 12.52 1.36
16 17.89 23.85 1.33
32 32.3 41.59 1.29
64 53.08 62.94 1.19
  • Task2: Prompt Length: 1024 tokens, Generated Tokens: 1 token
Threads Neon (Tokens/sec) SVE  (Tokens/sec) Ratio
8 8.95 11.66 1.3
16 17.3 21.97 1.27
32 31.07 38.48 1.24
64 50.73 58.99 1.16

Perplexity

There is no change in model accuracy as a result of this PR.
Command: ./build/bin/llama-perplexity -s 0 -np 128 -t 64 -m falcon-mamba-7B-F32.gguf -c 128 -b 128 --chunks 16 -f scripts/wikitext-2-raw/wiki.test.raw

NEON SVE
7.6153 +/- 0.66890 7.6153 +/- 0.66890

Contributor: Vineel Abhinav Gottala

cc: @Vithulep

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 17, 2025
@abhijain1204fujitsu
Copy link

Hi @ggerganov please support to review this PR

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to split the PR in 2 parts. First part with:

  • Add SVE support for ggml_vec_dot_f32() function.
  • Add SVE support for ggml_vec_mad_f32() function.
  • Add SVE support for ggml_vec_scale_f32() function.

The second part with Mamba-specific changes.

For the first part I need to see what is the improvement over the existing GGML_SIMD implementation using ARM_NEON for example, which AFAIK should always be available when SVE is available.

Comment on lines +152 to +186
#if defined(__ARM_FEATURE_SVE)

GGML_F32_VEC vx = GGML_F32_VEC_SET1(v);
const int sve_register_length = ggml_cpu_get_sve_cnt() * 8;
const int ggml_f32_epr = sve_register_length / 32;//8;//svcntw(); // SVE128:4, SVE256:8, SVE512:16
const int ggml_f32_step = 2 * ggml_f32_epr;
GGML_F32_VEC vx = GGML_F32_VEC_SET1(v);

GGML_F32_VEC ax[GGML_F32_ARR];
GGML_F32_VEC ay[GGML_F32_ARR];
const int np = (n & ~(ggml_f32_step - 1));
svfloat32_t ax1,ax2;
svfloat32_t ay1,ay2;
for ( int i = 0; i < np; i += ggml_f32_step) {

for (int i = 0; i < np; i += GGML_F32_STEP) {
for (int j = 0; j < GGML_F32_ARR; j++) {
ax[j] = GGML_F32_VEC_LOAD(x + i + j*GGML_F32_EPR);
ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
ay[j] = GGML_F32_VEC_FMA(ay[j], ax[j], vx);
ax1 = GGML_F32_VEC_LOAD(x + i);
ay1 = GGML_F32_VEC_LOAD(y + i);
ay1 = GGML_F32_VEC_FMA(ax1, vx, ay1);

GGML_F32_VEC_STORE(y + i + j*GGML_F32_EPR, ay[j]);
GGML_F32_VEC_STORE(y + i, ay1);

ax2 = GGML_F32_VEC_LOAD(x + i + 1*ggml_f32_epr);
ay2 = GGML_F32_VEC_LOAD(y + i + 1*ggml_f32_epr);
ay2 = GGML_F32_VEC_FMA(ax2, vx, ay2);

GGML_F32_VEC_STORE(y + i + 1*ggml_f32_epr, ay2);
}
}
// leftovers
// maximum number of leftover elements will be less that ggml_f32_epr. Apply predicated svmad on available elements only
if(np<n) {
svbool_t pg =svwhilelt_b32(np, n);
ax1 = svld1_f32(pg, x + np);
ay1 = svld1_f32(pg, y + np);
ay1 = svmad_f32_m(pg, ax1, vx, ay1);

svst1_f32(pg, y + np, ay1);
}
#else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is the benefit from these special-cased implementations compared to using the GGML_SIMD abstraction? If the benefit is not significant, it's better to use the existing implementation and avoid this extra code.

Copy link
Author

@vineelabhinav vineelabhinav May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ggerganov,
The function ggml_vec_mad_f32() is called from flash attention operation. Currently this operation is supported in some models like phi-3-7B, falcon-7B and Qwen-4k. I have converted all these to F32 gguf format. But none of the models are using ggml_compute_forward_flash_attn_back_f32() instead they are using ggml_compute_forward_flash_attn_ext_f16() which doesnot call ggml_vec_mad_f32() . For this reason I could'nt show the performance results. But, I can assure there will be good benifit compared to Neon version if model uses this function because we saw speed up for ggml_vec_dot_f32() which is similar to this function.
Will it be fine to proceed pushing this function?

Comment on lines +588 to +591
/* Below function was borrowed from the GitHub repository:
https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/common.hpp */
#if defined(__ARM_FEATURE_SVE) && defined(__aarch64__)
inline static svfloat32_t exp_ps_sve(svbool_t pg, svfloat32_t src) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that it is a good idea to borrow code like this. Better implement this from scratch, or use scalar code if it is too difficult.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ggerganov,
We are not calling this function from external library. This function is already implemented from scratch by our team and we are not borrowing from external source. We will remove the comment which is mentioned above the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants