Skip to content

Conversation

@fj-y-saito
Copy link
Contributor

@fj-y-saito fj-y-saito commented Aug 13, 2025

This PR improves q4_k_q8_k and q6_K_q8_K gemm kernel with arm64 i8mm instruction with SVE.
similar proposal for NEON support is made in PR #13886
Since it uses SVE instructions, it is characterized by improved performance even on machines with a SIMD width of 128 bits or more.

Verifying Features

This PR contains the SVE implementation of the vector dot used to compute the Q4_K quantization.
By running a Q4_K_M quantized model of Llama-3.1-8B, I confirmed that the values match.
I also verified that the perplexity matches between the NEON and SVE implementations.

NEON SVE(this PR)
6.5772 +/- 0.04061 6.5774 +/- 0.04062

performance check

Performance was measured with AWS Graviton3.
Performance is improved as follows (measured with llama-bench).

original

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp1 |         17.60 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp2 |         22.74 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp4 |         24.83 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp8 |         26.57 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp512 |         27.50 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           tg128 |         17.30 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp1 |         31.50 ± 0.07 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp2 |         42.44 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp4 |         47.74 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp8 |         51.98 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp512 |         54.69 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           tg128 |         31.29 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp1 |         40.51 ± 0.05 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp2 |         66.38 ± 0.08 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp4 |         78.73 ± 0.04 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp8 |         87.98 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp512 |         96.20 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           tg128 |         40.36 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp1 |         45.10 ± 0.05 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp2 |         74.95 ± 0.10 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp4 |         99.42 ± 0.06 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp8 |        114.52 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp512 |        136.11 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           tg128 |         44.74 ± 0.01 |

This PR

| model                          |       size |     params | backend    | threads |            test |                  t/s1|
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------:1|
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp1 |         17.36 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp2 |         27.59 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp4 |         31.10 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |             pp8 |         33.53 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp512 |         35.36 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           tg128 |         17.20 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp1 |         31.42 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp2 |         50.81 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp4 |         58.81 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |             pp8 |         65.04 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp512 |         70.26 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           tg128 |         31.08 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp1 |         40.88 ± 0.10 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp2 |         73.11 ± 0.08 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp4 |         92.12 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |             pp8 |        105.67 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp512 |        119.13 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           tg128 |         40.56 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp1 |         45.56 ± 0.11 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp2 |         76.08 ± 0.12 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp4 |        113.12 ± 0.23 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |             pp8 |        134.91 ± 0.21 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp512 |        165.69 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           tg128 |         44.94 ± 0.01 |

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 13, 2025
@fj-y-saito fj-y-saito changed the title add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… Aug 13, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, it's probably a better idea to implement GEMM improvements through the repack mechanism in ggml. It would give you more flexibility for rearranging the data to better fit the instructions.

r1 = svreinterpret_s8_s64(svzip2_s64(svreinterpret_s64_s8(q8bytes_0_h), svreinterpret_s64_s8(q8bytes_1_h)));
r2 = svreinterpret_s8_s64(svzip1_s64(svreinterpret_s64_s8(q8bytes_0_l), svreinterpret_s64_s8(q8bytes_1_l)));
r3 = svreinterpret_s8_s64(svzip2_s64(svreinterpret_s64_s8(q8bytes_0_l), svreinterpret_s64_s8(q8bytes_1_l)));
sumi2 = svmmla_s32(svmmla_s32(svmmla_s32(svmmla_s32(svdup_n_s32(0), r0, l0), r1, l1), r2, l2), r3, l3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does svmmla_s32 require to check for __ARM_FEATURE_SVE_MATMUL_INT8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. svmmla_s32() is intrinsic function for sve instruction and it need i8mm feature to CPU.

@fj-y-saito
Copy link
Contributor Author

Btw, it's probably a better idea to implement GEMM improvements through the repack mechanism in ggml. It would give you more flexibility for rearranging the data to better fit the instructions.

Could you please explain what the repack mechanism refers to, or point me to the relevant implementation details?
I tried looking into it and also found PR #10446, but I’m still not quite sure how the repack implementation actually works or whether that PR is directly related.

@fj-y-saito
Copy link
Contributor Author

@ggerganov
Sorry for the late reply, and thanks for the suggestion — you’re right, using the repack mechanism could enable more flexible data rearrangement.
From what I’ve observed so far, it seems that repack might make memory access more efficient and reduce kernel-side rearrangement (e.g., zip instructions), which could be especially beneficial for low-batch inference.
I’ll take a closer look at the repack implementation to confirm this.
For now, I’d like to keep this PR as a baseline improvement with immediate performance benefits, and plan to explore repack integration in a follow-up PR.

@fj-y-saito
Copy link
Contributor Author

Hi @ggerganov , just following up on this one
I wanted to check if you had any further thoughts on keeping this PR as a standalone improvement.
From my side, I think this PR already provides immediate performance benefits and can be merged as is.
Regarding the repack-based approach, I won’t be able to implement it in the near term, but it’s an interesting direction for future exploration.
Thanks again for your earlier feedback!

#endif

#if defined(__ARM_FEATURE_MATMUL_INT8)
#ifdef __ARM_FEATURE_SVE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

Suggested change
#ifdef __ARM_FEATURE_SVE
#if defined(__ARM_FEATURE_SVE) && defined(__ARM_FEATURE_SVE_MATMUL_INT8)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov
Addressed the review comments in the latest commit.
The CI failure seems unrelated (network/download issue). Please re-run when convenient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ggerganov
Just wanted to check in if you had a chance to look at this.
The review comments have been resolved, and the CI failure seems unrelated.
Thanks as always for your time.

@fj-y-saito fj-y-saito requested a review from slaren as a code owner October 27, 2025 22:54
@fj-y-saito fj-y-saito force-pushed the feat-sve-i8mm-q4_K_quantization branch from 58f7970 to ca1884b Compare October 28, 2025 01:09
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some coding style fixes.

return vutmp;
}
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use 4 space indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comments. I've done.


#if defined(__ARM_FEATURE_MATMUL_INT8)
#if defined(__ARM_FEATURE_SVE) && defined(__ARM_FEATURE_MATMUL_INT8)
if (nrc==2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (nrc==2) {
if (nrc == 2) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 2112 to 2114
svbool_t pg128_all = svptrue_pat_b8(SV_VL16);
for (int i = 0; i < nb; ++i) {
svfloat32_t vy_d = svuzp1_f32(svdup_n_f32(vy0[i].d), svdup_n_f32(vy1[i].d));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix indentation of this for loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
svst1_f32(pg32_2, s, sumf1);
svst1_f32(pg32_2, s + bs, svreinterpret_f32_u8(svext_u8(svreinterpret_u8_f32(sumf1), svdup_n_u8(0), 8)));
return ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ;
return;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

#ifdef __ARM_FEATURE_SVE
static inline svuint32_t ggml_decode_q4scales_and_mins_for_mmla(const uint32_t *vx_scales) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static inline svuint32_t ggml_decode_q4scales_and_mins_for_mmla(const uint32_t *vx_scales) {
static inline svuint32_t ggml_decode_q4scales_and_mins_for_mmla(const uint32_t * vx_scales) {

Copy link
Contributor Author

@fj-y-saito fj-y-saito Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov
Addressed the review comments in the latest commit.
The CI failure seems unrelated. Please re-run when convenient.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution. Because of limited SVE hardware it's a bit difficult to approve these changes. Will ping you for support if we face problems with this code in the future.

@fj-y-saito
Copy link
Contributor Author

thank you @ggerganov
Sure, please feel free to reach out anytime if any issues come up with the SVE path.

@ggerganov ggerganov merged commit df70bed into ggml-org:master Nov 10, 2025
65 of 71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants