-
Notifications
You must be signed in to change notification settings - Fork 13.7k
arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #15277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… #15277
Conversation
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, it's probably a better idea to implement GEMM improvements through the repack mechanism in ggml. It would give you more flexibility for rearranging the data to better fit the instructions.
ggml/src/ggml-cpu/arch/arm/quants.c
Outdated
| r1 = svreinterpret_s8_s64(svzip2_s64(svreinterpret_s64_s8(q8bytes_0_h), svreinterpret_s64_s8(q8bytes_1_h))); | ||
| r2 = svreinterpret_s8_s64(svzip1_s64(svreinterpret_s64_s8(q8bytes_0_l), svreinterpret_s64_s8(q8bytes_1_l))); | ||
| r3 = svreinterpret_s8_s64(svzip2_s64(svreinterpret_s64_s8(q8bytes_0_l), svreinterpret_s64_s8(q8bytes_1_l))); | ||
| sumi2 = svmmla_s32(svmmla_s32(svmmla_s32(svmmla_s32(svdup_n_s32(0), r0, l0), r1, l1), r2, l2), r3, l3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does svmmla_s32 require to check for __ARM_FEATURE_SVE_MATMUL_INT8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. svmmla_s32() is intrinsic function for sve instruction and it need i8mm feature to CPU.
Could you please explain what the repack mechanism refers to, or point me to the relevant implementation details? |
|
@ggerganov |
|
Hi @ggerganov , just following up on this one |
ggml/src/ggml-cpu/arch/arm/quants.c
Outdated
| #endif | ||
|
|
||
| #if defined(__ARM_FEATURE_MATMUL_INT8) | ||
| #ifdef __ARM_FEATURE_SVE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
| #ifdef __ARM_FEATURE_SVE | |
| #if defined(__ARM_FEATURE_SVE) && defined(__ARM_FEATURE_SVE_MATMUL_INT8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov
Addressed the review comments in the latest commit.
The CI failure seems unrelated (network/download issue). Please re-run when convenient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ggerganov
Just wanted to check in if you had a chance to look at this.
The review comments have been resolved, and the CI failure seems unrelated.
Thanks as always for your time.
58f7970 to
ca1884b
Compare
ca1884b to
aca4ea8
Compare
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some coding style fixes.
| return vutmp; | ||
| } | ||
| #endif | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use 4 space indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comments. I've done.
ggml/src/ggml-cpu/arch/arm/quants.c
Outdated
|
|
||
| #if defined(__ARM_FEATURE_MATMUL_INT8) | ||
| #if defined(__ARM_FEATURE_SVE) && defined(__ARM_FEATURE_MATMUL_INT8) | ||
| if (nrc==2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (nrc==2) { | |
| if (nrc == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ggml/src/ggml-cpu/arch/arm/quants.c
Outdated
| svbool_t pg128_all = svptrue_pat_b8(SV_VL16); | ||
| for (int i = 0; i < nb; ++i) { | ||
| svfloat32_t vy_d = svuzp1_f32(svdup_n_f32(vy0[i].d), svdup_n_f32(vy1[i].d)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix indentation of this for loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ggml/src/ggml-cpu/arch/arm/quants.c
Outdated
| } | ||
| svst1_f32(pg32_2, s, sumf1); | ||
| svst1_f32(pg32_2, s + bs, svreinterpret_f32_u8(svext_u8(svreinterpret_u8_f32(sumf1), svdup_n_u8(0), 8))); | ||
| return ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return ; | |
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ggml/src/ggml-cpu/arch/arm/quants.c
Outdated
| } | ||
|
|
||
| #ifdef __ARM_FEATURE_SVE | ||
| static inline svuint32_t ggml_decode_q4scales_and_mins_for_mmla(const uint32_t *vx_scales) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static inline svuint32_t ggml_decode_q4scales_and_mins_for_mmla(const uint32_t *vx_scales) { | |
| static inline svuint32_t ggml_decode_q4scales_and_mins_for_mmla(const uint32_t * vx_scales) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov
Addressed the review comments in the latest commit.
The CI failure seems unrelated. Please re-run when convenient.
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution. Because of limited SVE hardware it's a bit difficult to approve these changes. Will ping you for support if we face problems with this code in the future.
|
thank you @ggerganov |
This PR improves q4_k_q8_k and q6_K_q8_K gemm kernel with arm64 i8mm instruction with SVE.
similar proposal for NEON support is made in PR #13886
Since it uses SVE instructions, it is characterized by improved performance even on machines with a SIMD width of 128 bits or more.
Verifying Features
This PR contains the SVE implementation of the vector dot used to compute the Q4_K quantization.
By running a Q4_K_M quantized model of Llama-3.1-8B, I confirmed that the values match.
I also verified that the perplexity matches between the NEON and SVE implementations.
performance check
Performance was measured with AWS Graviton3.
Performance is improved as follows (measured with
llama-bench).original
This PR