ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #18096

Alcpz · 2025-12-16T13:39:24Z

This is similar work to #17494 and #16739.
The primary motivation for q8_0 repack is to optimize prefill/prompt processing performance, which can be useful for, for example, audio models.
Token generation/decode is mixed (It's slightly better in M4, and slightly worse on RPI5, probably device dependent).

This PR implements:

ARM specific gemv and gemms for q8_0 (Dotprod and I8mm)
Generic fallback implementation
Repack functions to prepare weight matrices in interleaved format

M4 max - i8mm path (GGML_NATIVE=ON native+dotprod+i8mm+nosve+sme)

Model	Test Type	t/s (repack)	t/s (no-repack)	Speedup
lfm2 1.2B Q8_0	pp256	1085.44	303.67	3.57x
lfm2 1.2B Q8_0	tg128	177.58	166.51	1.07x
lfm2 700M Q8_0	pp256	1732.73	495.01	3.50x
lfm2 700M Q8_0	tg128	262.94	247.08	1.06x
qwen3 8B Q8_0	pp256	160.03	43.03	3.72x
qwen3 8B Q8_0	tg128	28.75	28.71	1.00x

build: 26ef677 (7365)

Rpi5 - dotprod path (GGML_NATIVE=ON cortex-a76+crypto+dotprod+noi8mm+nosve)

Model	Test Type	t/s (repack)	t/s (no-repack)	Speedup
lfm2 350M Q8_0	pp256	360.45	182.33	1.98x
lfm2 350M Q8_0	tg128	29.77	30.15	0.99x
lfm2 700M Q8_0	pp256	105.62	85.60	1.23x
lfm2 700M Q8_0	tg128	14.51	14.96	0.97x

Perplexity

Perplexity is slightly different for the generic implementations, but I don't find any algorithmic differences except for the order of operations. Generic i8mm (4x8) gets better perplexity than the original implementation, while generic dotprod (4x4) is slightly worse. Repack vs no-repack has the same values.

Model	i8mm Repack PPL	i8mm No-repack PPL	i8mm Generic PPL
Llama 3.1 8B	8.6577	8.6645	8.6645
Qwen3 8B	11.1794	11.1909	10.8830
LFM2 1.2B	16.7665	16.7623	15.1945

Model	dotprod Repack PPL	dotprod No-repack PPL	dotprod Generic PPL
Llama 3.1 8B	8.2089	8.2050	8.6645
Qwen3 8B	10.8830	10.8729	11.1909
LFM2 1.2B	15.1945	15.1895	16.7623

Decode

Following the advise from previous PRs I checked the outputs of GEMVs and all are under the threshold defined in test-backend-ops (same as the one used here: #17494 (comment)), there are some individual weight differences. Looking directly at llama-completion and llama-server, I didn't find artifacts or bad answers.

Ci passes locally when routing through the generic path.

Thought these are not exactly accurate tests:

Some llama-cli queries

NO REPACK

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the enchanting city of lights, is renowned for its iconic landmarks such as the Eiffel Tower and the historic Louvre Museum, attracting millions of visitors from around the globe each year to experience its timeless charm, rich history, and vibrant culture.

[ Prompt: 356.3 t/s | Generation: 140.9 t/s ]

=================================================

i8mm

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the City of Light, is renowned for its stunning architecture, artistic heritage, and romantic ambiance, making it a quintessential destination for travelers worldwide.

[ Prompt: 551.6 t/s | Generation: 118.8 t/s ]

Paris is the capital of?

Yes, Paris is indeed the capital of France. It serves as the political, cultural, and economic heart of the country.

[ Prompt: 562.1 t/s | Generation: 131.3 t/s ]

===============================================================

DOTPROD

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the City of Light, is renowned globally for its rich history, stunning architecture, and cultural treasures like the Eiffel Tower and the Louvre Museum.

[ Prompt: 540.8 t/s | Generation: 119.9 t/s ]

Paris is the capital of?

Paris is the capital of France.

[ Prompt: 552.4 t/s | Generation: 133.3 t/s ]

===============================================================

Generic

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the enchanting capital city of France, is renowned for its timeless charm, iconic landmarks such as the Eiffel Tower, and romantic atmosphere that attracts millions of visitors each year.

[ Prompt: 139.7 t/s | Generation: 240.1 t/s ]

Alcpz · 2025-12-16T13:41:56Z

@tdakhran @ykhrustalev FYI

Alcpz · 2025-12-17T10:17:46Z

@ggerganov I am not certain if the CI failures are related to the PR, #18103 and #17990 present similar failures:

ubuntu-cpu-cmake-riscv64-native failed at test-tokenizers-ggml-vocabs, and both ggml-ci-x64-cpu-[low|high]-perf are failing a couple of tests due to: failed to find ggml_backend_init in .../libggml-cpu.so.

ggerganov · 2025-12-17T11:39:09Z

These failures are likely related to ccache - not relevant to the changes here.

…experimental Keep changes from ggml-org#18096 without ggml-org#14904 Reason is to maintain compatibility with 2023 w64devkit # Conflicts: # .github/ISSUE_TEMPLATE/019-bug-misc.yml # examples/model-conversion/scripts/causal/run-org-model.py # examples/speculative/speculative.cpp # ggml/src/ggml-cpu/arch-fallback.h # ggml/src/ggml-cpu/repack.cpp # ggml/src/ggml-cpu/repack.h # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/act-ops.c # ggml/src/ggml-hexagon/htp/htp-msg.h # ggml/src/ggml-hexagon/htp/hvx-utils.c # ggml/src/ggml-hexagon/htp/hvx-utils.h # ggml/src/ggml-hexagon/htp/main.c

Alcpz added 8 commits December 11, 2025 17:53

wip: skeleton for q8_0 repack

30c9d1e

q8_0 repack GEMV implementations

fbe5fd4

GEMM implementations

0d14b67

Formatting

360cbd9

Fixed format consistency of repack gemm and gemv declarations

275237c

gemv and gemm generic location consistent with declarations

9584366

Removed non-correct unused variables statements

815123c

Cleanup, consistent style

26ef677

Alcpz requested a review from ggerganov as a code owner December 16, 2025 13:39

Alcpz changed the title ~~Alcpz/cpu q8 0 repack~~ ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) Dec 16, 2025

ggerganov approved these changes Dec 16, 2025

View reviewed changes

loci-dev mentioned this pull request Dec 16, 2025

UPSTREAM PR #18096: ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) auroralabs-loci/llama.cpp#589

Open

Missing generic fallbacks for x86 and powerpc

1216d23

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 16, 2025

ggerganov merged commit 669696e into ggml-org:master Dec 17, 2025
63 of 71 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #18096

ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #18096

Alcpz commented Dec 16, 2025 •

edited

Loading

Uh oh!

Alcpz commented Dec 16, 2025

Uh oh!

Alcpz commented Dec 17, 2025

Uh oh!

ggerganov commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #18096

ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #18096

Conversation

Alcpz commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

M4 max - i8mm path (GGML_NATIVE=ON native+dotprod+i8mm+nosve+sme)

Rpi5 - dotprod path (GGML_NATIVE=ON cortex-a76+crypto+dotprod+noi8mm+nosve)

Perplexity

Decode

Uh oh!

Alcpz commented Dec 16, 2025

Uh oh!

Alcpz commented Dec 17, 2025

Uh oh!

ggerganov commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alcpz commented Dec 16, 2025 •

edited

Loading