-
Notifications
You must be signed in to change notification settings - Fork 14.2k
ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #18096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+605
−0
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
Author
ggerganov
approved these changes
Dec 16, 2025
Contributor
Author
|
@ggerganov I am not certain if the CI failures are related to the PR, #18103 and #17990 present similar failures:
|
Member
|
These failures are likely related to ccache - not relevant to the changes here. |
LostRuins
added a commit
to LostRuins/koboldcpp
that referenced
this pull request
Dec 18, 2025
…experimental Keep changes from ggml-org#18096 without ggml-org#14904 Reason is to maintain compatibility with 2023 w64devkit # Conflicts: # .github/ISSUE_TEMPLATE/019-bug-misc.yml # examples/model-conversion/scripts/causal/run-org-model.py # examples/speculative/speculative.cpp # ggml/src/ggml-cpu/arch-fallback.h # ggml/src/ggml-cpu/repack.cpp # ggml/src/ggml-cpu/repack.h # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/act-ops.c # ggml/src/ggml-hexagon/htp/htp-msg.h # ggml/src/ggml-hexagon/htp/hvx-utils.c # ggml/src/ggml-hexagon/htp/hvx-utils.h # ggml/src/ggml-hexagon/htp/main.c
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is similar work to #17494 and #16739.
The primary motivation for q8_0 repack is to optimize prefill/prompt processing performance, which can be useful for, for example, audio models.
Token generation/decode is mixed (It's slightly better in M4, and slightly worse on RPI5, probably device dependent).
This PR implements:
M4 max - i8mm path (GGML_NATIVE=ON native+dotprod+i8mm+nosve+sme)
build: 26ef677 (7365)
Rpi5 - dotprod path (GGML_NATIVE=ON cortex-a76+crypto+dotprod+noi8mm+nosve)
Perplexity
Perplexity is slightly different for the generic implementations, but I don't find any algorithmic differences except for the order of operations. Generic i8mm (4x8) gets better perplexity than the original implementation, while generic dotprod (4x4) is slightly worse. Repack vs no-repack has the same values.
Decode
Following the advise from previous PRs I checked the outputs of GEMVs and all are under the threshold defined in
test-backend-ops(same as the one used here: #17494 (comment)), there are some individual weight differences. Looking directly at llama-completion and llama-server, I didn't find artifacts or bad answers.Ci passes locally when routing through the generic path.
Thought these are not exactly accurate tests:
Some llama-cli queries
NO REPACK
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
Paris, the enchanting city of lights, is renowned for its iconic landmarks such as the Eiffel Tower and the historic Louvre Museum, attracting millions of visitors from around the globe each year to experience its timeless charm, rich history, and vibrant culture.
[ Prompt: 356.3 t/s | Generation: 140.9 t/s ]
=================================================
i8mm
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
Paris, the City of Light, is renowned for its stunning architecture, artistic heritage, and romantic ambiance, making it a quintessential destination for travelers worldwide.
[ Prompt: 551.6 t/s | Generation: 118.8 t/s ]
Yes, Paris is indeed the capital of France. It serves as the political, cultural, and economic heart of the country.
[ Prompt: 562.1 t/s | Generation: 131.3 t/s ]
===============================================================
DOTPROD
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
Paris, the City of Light, is renowned globally for its rich history, stunning architecture, and cultural treasures like the Eiffel Tower and the Louvre Museum.
[ Prompt: 540.8 t/s | Generation: 119.9 t/s ]
Paris is the capital of France.
[ Prompt: 552.4 t/s | Generation: 133.3 t/s ]
===============================================================
Generic
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
Paris, the enchanting capital city of France, is renowned for its timeless charm, iconic landmarks such as the Eiffel Tower, and romantic atmosphere that attracts millions of visitors each year.
[ Prompt: 139.7 t/s | Generation: 240.1 t/s ]