-
Notifications
You must be signed in to change notification settings - Fork 14.2k
ggml : block repack support for Q4_K quanti for AArch64 architecture #15719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* new quanti: block_q4_kx4 with offline repack impl * new quantize path: neon impl for ggml_quantize_mat_q8_K_4x8 * new gemv kernel: ggml_gemv_q4_K_4x8_q8_K based on dotprod * new gemm kernel: ggml_gemm_q4_K_4x8_q8_K based on i8mm * performance boost for both S_PP and S_TG --------- Co-authored-by: yuanjia111 <yuan.jia@sanechips.com.cn>
543e8eb to
126ce2c
Compare
|
@ggerganov Hi,I just updated this PR to fix some CI issue for old version in Sept and already sync with latest master, performance data are also updated. But the workflow seems blocked now. |
|
I'm not yet sure it is worth adding this additional implementation. Can it not be consolidated with the rest of the repacking schemes? cc @Alcpz for opinions |
@ggerganov Understand this concern. The 4x8 scheme of this PR mainly aims at ne[1]%4==0, a more general scenario compared to 8x4/8x8. Each of the 3 schemes has a different scenario. BTW, the only failed CI check seems to be a JSON parsing issue, not related to this PR I think. |

This PR improves q4_k_q8_k kernel with repacking support for AArch64 platform.
It has 2 enabling conditions:
Following structures and functions are implemented:
block_q4_kx4based on 4 q4_k blocks, along with offline repacking functionblock_q8_Kx4inggml_quantize_mat_q8_K_4x8()ggml_gemv_q4_K_4x8_q8_K()dotprod kernelggml_gemm_q4_K_4x8_q8_K()i8mm kernelFor now, q4_k_q8_k repacking has 3 implementations including this PR, here's a brief summary:
Each implementation has its specific suitable scenario.

For this PR, I temprorarily put the priority of 4x8 after 8x8:
However for performance test, comprehensive comparisons are conducted among the 3 impls on same models. This is done by uncomenting different if branches when doing tensor_traits choosing for q4_k tensor type.
Test environment
Bench results
(1) meta-llama-3-8b-instruct.Q4_K_M.gguf
S_PP (t/s)
(no repack)
(this PR)
S_TG (t/s)
(no repack)
(this PR)
Compared to 8x4/8x8, 4x8 performs better at TG and worse at PP. All the 3 outperform the base(no repack).
(2) DeepSeek-V3-Q4_k_M.gguf
S_PP (t/s)
(no repack)
(this PR)
S_TG (t/s)
(no repack)
(this PR)
Compared to 8x4/8x8, 4x8 performs better at TG and worse at PP. All the 3 outperform the base(no repack).
Perplexity
Notes