Skip to content

Conversation

@hongyang-7
Copy link

@hongyang-7 hongyang-7 commented Sep 1, 2025

This PR improves q4_k_q8_k kernel with repacking support for AArch64 platform.

It has 2 enabling conditions:

  1. i8mm support.
  2. tensor.ne[1] % 4 == 0

Following structures and functions are implemented:

  • new quanti: block_q4_kx4 based on 4 q4_k blocks, along with offline repacking function
  • new quantize path: neon implementation for block_q8_Kx4 in ggml_quantize_mat_q8_K_4x8()
  • new gemv kernel: ggml_gemv_q4_K_4x8_q8_K() dotprod kernel
  • new gemm kernel: ggml_gemm_q4_K_4x8_q8_K() i8mm kernel

For now, q4_k_q8_k repacking has 3 implementations including this PR, here's a brief summary:

mode packing column number ISA dependency tensor shape PR
q4_K_8x4_q8_K 8 dotprod ne[1]%8==0 #17494
q4_K_8x8_q8_K 8 i8mm ne[1]%8==0 #16739
q4_K_4x8_q8_K 4 i8mm ne[1]%4==0 this PR

Each implementation has its specific suitable scenario.
For this PR, I temprorarily put the priority of 4x8 after 8x8:
image

However for performance test, comprehensive comparisons are conducted among the 3 impls on same models. This is done by uncomenting different if branches when doing tensor_traits choosing for q4_k tensor type.

Test environment

  • Server: Neoverse-N2
  • System_info: n_threads = 64 (n_threads_batch = 64) / 256 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
  • Competitors: ① base (no q4_k repack) ② 8x4 impl ③ 8x8 impl ④ 4x8 impl
  • Models: 2 of different scales
models storage size param size quanti
meta-llama-3-8b-instruct.Q4_K_M.gguf 4.6G 8.03B Q4_K_M
DeepSeek-V3-Q4_k_M.gguf 377G 671B Q4_K_M

Bench results

(1) meta-llama-3-8b-instruct.Q4_K_M.gguf

./bin/llama-batched-bench -m /mnt/models/meta-llama-3-8b-instruct.Q4_K_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16 -t 64 --no-mmap

S_PP (t/s)

B ①base
(no repack)
② 8x4 ③ 8x8 ④ 4x8
(this PR)
④ vs ① ④ vs ② ④ vs ③
1 171.58 234.90 246.05 241.49 140.7% 102.8% 98.1%
4 179.66 246.19 258.52 245.70 136.8% 99.8% 95.0%
8 180.63 247.54 259.75 247.18 136.8% 99.9% 95.2%
16 180.32 247.65 259.84 247.59 137.3% 100.0% 95.3%

S_TG (t/s)

B ①base
(no repack)
② 8x4 ③ 8x8 ④ 4x8
(this PR)
④ vs ① ④ vs ② ④ vs ③
1 39.27 36.62 36.69 36.22 92.2% 98.9% 98.7%
4 76.93 82.85 83.22 94.90 123.4% 114.5% 114.0%
8 103.94 116.79 118.76 129.74 124.8% 111.1% 109.2%
16 126.75 149.26 153.59 167.59 132.2% 112.3% 109.1%

Compared to 8x4/8x8, 4x8 performs better at TG and worse at PP. All the 3 outperform the base(no repack).

(2) DeepSeek-V3-Q4_k_M.gguf

./bin/llama-batched-bench -m /mnt/models/DeepSeek-V3-Q4_k_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16 -t 64 --no-mmap

S_PP (t/s)

B ①base
(no repack)
② 8x4 ③ 8x8 ④ 4x8
(this PR)
④ vs ① ④ vs ② ④ vs ③
1 21.92 29.67 28.78 27.94 127.5% 94.2% 97.1%
4 25.20 33.62 32.3 31.03 123.1% 92.3% 96.1%
8 25.56 33.78 32.42 31.12 121.8% 92.1% 96.0%
16 25.55 33.79 32.39 31.11 121.8% 92.1% 96.0%

S_TG (t/s)

B ①base
(no repack)
② 8x4 ③ 8x8 ④ 4x8
(this PR)
④ vs ① ④ vs ② ④ vs ③
1 7.28 7.15 7.17 7.12 97.8% 99.6% 99.3%
4 12.61 12.46 11.52 13.19 104.6% 105.9% 114.5%
8 15.10 15.45 14.74 15.91 105.4% 103.0% 107.9%
16 17.05 18.16 17.54 18.06 105.9% 99.4% 103.0%

Compared to 8x4/8x8, 4x8 performs better at TG and worse at PP. All the 3 outperform the base(no repack).

Perplexity

models ①base ② 8x4 ③ 8x8 ④ 4x8
meta-llama-3-8b-instruct 3.7482 +/- 0.14273 3.7615 +/- 0.14341 3.7562 +/- 0.14319 3.7576 +/- 0.14305
DeepSeek-V3 1.0382 +/- 0.00630 1.0382 +/- 0.00626 1.0380 +/- 0.00623 1.0400 +/- 0.00642

Notes

  1. This PR defines block_q4_Kx4 structure, which is not a simple four-block_q4_K sturcture, but move the scales dequant step (recover from 6bit to 8bit) ahead to the offline repacking stage, causing the repacking result storage a bit larger than before. To fix possible memory allocation problem, this PR also introduces some mechnism to expand the memory space for repacking.
  2. This PR has different q8_k online repacking layout compared to the C generic version.
  3. This PR's repacking idea is originally for q4_0 on arm: Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization #5780

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 1, 2025
hongyang-7 and others added 6 commits December 18, 2025 14:14
* new quanti: block_q4_kx4 with offline repack impl

* new quantize path: neon impl for ggml_quantize_mat_q8_K_4x8

* new gemv kernel: ggml_gemv_q4_K_4x8_q8_K based on dotprod

* new gemm kernel: ggml_gemm_q4_K_4x8_q8_K based on i8mm

* performance boost for both S_PP and S_TG

---------

Co-authored-by: yuanjia111 <yuan.jia@sanechips.com.cn>
@hongyang-7
Copy link
Author

hongyang-7 commented Dec 19, 2025

@ggerganov Hi,I just updated this PR to fix some CI issue for old version in Sept and already sync with latest master, performance data are also updated. But the workflow seems blocked now.
The hint shows some approval is needed, I thought the CI should be executed automatically:
image

@ggerganov
Copy link
Member

I'm not yet sure it is worth adding this additional implementation. Can it not be consolidated with the rest of the repacking schemes?

cc @Alcpz for opinions

@hongyang-7
Copy link
Author

hongyang-7 commented Dec 19, 2025

I'm not yet sure it is worth adding this additional implementation. Can it not be consolidated with the rest of the repacking schemes?

cc @Alcpz for opinions

@ggerganov Understand this concern. The 4x8 scheme of this PR mainly aims at ne[1]%4==0, a more general scenario compared to 8x4/8x8. Each of the 3 schemes has a different scenario.

BTW, the only failed CI check seems to be a JSON parsing issue, not related to this PR I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants