Skip to content

Conversation

@Alcpz
Copy link
Contributor

@Alcpz Alcpz commented Dec 16, 2025

This is similar work to #17494 and #16739.
The primary motivation for q8_0 repack is to optimize prefill/prompt processing performance, which can be useful for, for example, audio models.
Token generation/decode is mixed (It's slightly better in M4, and slightly worse on RPI5, probably device dependent).

This PR implements:

  • ARM specific gemv and gemms for q8_0 (Dotprod and I8mm)
  • Generic fallback implementation
  • Repack functions to prepare weight matrices in interleaved format

M4 max - i8mm path (GGML_NATIVE=ON native+dotprod+i8mm+nosve+sme)

Model Test Type t/s (repack) t/s (no-repack) Speedup
lfm2 1.2B Q8_0 pp256 1085.44 303.67 3.57x
lfm2 1.2B Q8_0 tg128 177.58 166.51 1.07x
lfm2 700M Q8_0 pp256 1732.73 495.01 3.50x
lfm2 700M Q8_0 tg128 262.94 247.08 1.06x
qwen3 8B Q8_0 pp256 160.03 43.03 3.72x
qwen3 8B Q8_0 tg128 28.75 28.71 1.00x

build: 26ef677 (7365)

Rpi5 - dotprod path (GGML_NATIVE=ON cortex-a76+crypto+dotprod+noi8mm+nosve)

Model Test Type t/s (repack) t/s (no-repack) Speedup
lfm2 350M Q8_0 pp256 360.45 182.33 1.98x
lfm2 350M Q8_0 tg128 29.77 30.15 0.99x
lfm2 700M Q8_0 pp256 105.62 85.60 1.23x
lfm2 700M Q8_0 tg128 14.51 14.96 0.97x

Perplexity

Perplexity is slightly different for the generic implementations, but I don't find any algorithmic differences except for the order of operations. Generic i8mm (4x8) gets better perplexity than the original implementation, while generic dotprod (4x4) is slightly worse. Repack vs no-repack has the same values.

Model i8mm Repack PPL i8mm No-repack PPL i8mm Generic PPL
Llama 3.1 8B 8.6577 8.6645 8.6645
Qwen3 8B 11.1794 11.1909 10.8830
LFM2 1.2B 16.7665 16.7623 15.1945
Model dotprod Repack PPL dotprod No-repack PPL dotprod Generic PPL
Llama 3.1 8B 8.2089 8.2050 8.6645
Qwen3 8B 10.8830 10.8729 11.1909
LFM2 1.2B 15.1945 15.1895 16.7623

Decode

Following the advise from previous PRs I checked the outputs of GEMVs and all are under the threshold defined in test-backend-ops (same as the one used here: #17494 (comment)), there are some individual weight differences. Looking directly at llama-completion and llama-server, I didn't find artifacts or bad answers.

Ci passes locally when routing through the generic path.


Thought these are not exactly accurate tests:

Some llama-cli queries

NO REPACK

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the enchanting city of lights, is renowned for its iconic landmarks such as the Eiffel Tower and the historic Louvre Museum, attracting millions of visitors from around the globe each year to experience its timeless charm, rich history, and vibrant culture.

[ Prompt: 356.3 t/s | Generation: 140.9 t/s ]

=================================================

i8mm

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the City of Light, is renowned for its stunning architecture, artistic heritage, and romantic ambiance, making it a quintessential destination for travelers worldwide.

[ Prompt: 551.6 t/s | Generation: 118.8 t/s ]

Paris is the capital of?

Yes, Paris is indeed the capital of France. It serves as the political, cultural, and economic heart of the country.

[ Prompt: 562.1 t/s | Generation: 131.3 t/s ]

===============================================================

DOTPROD

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the City of Light, is renowned globally for its rich history, stunning architecture, and cultural treasures like the Eiffel Tower and the Louvre Museum.

[ Prompt: 540.8 t/s | Generation: 119.9 t/s ]

Paris is the capital of?

Paris is the capital of France.

[ Prompt: 552.4 t/s | Generation: 133.3 t/s ]

===============================================================

Generic

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the enchanting capital city of France, is renowned for its timeless charm, iconic landmarks such as the Eiffel Tower, and romantic atmosphere that attracts millions of visitors each year.

[ Prompt: 139.7 t/s | Generation: 240.1 t/s ]

@Alcpz Alcpz requested a review from ggerganov as a code owner December 16, 2025 13:39
@Alcpz
Copy link
Contributor Author

Alcpz commented Dec 16, 2025

@tdakhran @ykhrustalev FYI

@Alcpz Alcpz changed the title Alcpz/cpu q8 0 repack ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) Dec 16, 2025
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 16, 2025
@Alcpz
Copy link
Contributor Author

Alcpz commented Dec 17, 2025

@ggerganov I am not certain if the CI failures are related to the PR, #18103 and #17990 present similar failures:

ubuntu-cpu-cmake-riscv64-native failed at test-tokenizers-ggml-vocabs, and both ggml-ci-x64-cpu-[low|high]-perf are failing a couple of tests due to: failed to find ggml_backend_init in .../libggml-cpu.so.

@ggerganov
Copy link
Member

These failures are likely related to ccache - not relevant to the changes here.

@ggerganov ggerganov merged commit 669696e into ggml-org:master Dec 17, 2025
63 of 71 checks passed
LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Dec 18, 2025
…experimental

Keep changes from ggml-org#18096 without ggml-org#14904
Reason is to maintain compatibility with 2023 w64devkit

# Conflicts:
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# examples/model-conversion/scripts/causal/run-org-model.py
# examples/speculative/speculative.cpp
# ggml/src/ggml-cpu/arch-fallback.h
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-cpu/repack.h
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/htp-msg.h
# ggml/src/ggml-hexagon/htp/hvx-utils.c
# ggml/src/ggml-hexagon/htp/hvx-utils.h
# ggml/src/ggml-hexagon/htp/main.c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants