Skip to content

Conversation

@NoahOksuz
Copy link
Contributor

@NoahOksuz NoahOksuz commented Nov 2, 2025

Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps.

Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps.
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 3, 2025
@NoahOksuz
Copy link
Contributor Author

#16942

@catap
Copy link

catap commented Nov 3, 2025

I can't reproduce #16960 with this fix anymore.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is OK, but would be nice if @max-krasnyansky can take a look as well.

NoahOksuz and others added 2 commits November 3, 2025 14:17
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@max-krasnyansky
Copy link
Collaborator

Hmm. The change looks good but I'm seeing a significant regression in token rates from my earlier testing.
(around 10 t/s lower on the prompt). Reverted to my original commit and the token rates are same as before.
It's possible that some other changed merged since then are causing that. I'm looking into it ...

@max-krasnyansky
Copy link
Collaborator

Hmm. The change looks good but I'm seeing a significant regression in token rates from my earlier testing. (around 10 t/s lower on the prompt). Reverted to my original commit and the token rates are same as before. It's possible that some other changed merged since then are causing that. I'm looking into it ...

False alarm. Merging ...

@max-krasnyansky max-krasnyansky merged commit 1f5accb into ggml-org:master Nov 4, 2025
65 of 69 checks passed
GittyBurstein pushed a commit to yael-works/llama.cpp that referenced this pull request Nov 5, 2025
* Fix garbled output with REPACK at high thread counts

Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps.

* Update ggml/src/ggml-cpu/repack.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/repack.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Nov 5, 2025
* origin/master: (21 commits)
vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (ggml-org#16919)
examples(gguf): GGUF example outputs (ggml-org#17025)
mtmd: allow QwenVL to process larger image by default (ggml-org#17020)
server : do not default to multiple slots with speculative decoding (ggml-org#17017)
mtmd: improve struct initialization (ggml-org#16981)
docs: Clarify the endpoint that webui uses (ggml-org#17001)
model : add openPangu-Embedded (ggml-org#16941)
ggml webgpu: minor set rows optimization (ggml-org#16810)
sync : ggml
ggml : fix conv2d_dw SVE path (ggml/1380)
CUDA: update ops.md (ggml-org#17005)
opencl: update doc (ggml-org#17011)
refactor: replace sprintf with snprintf for safer string handling in dump functions (ggml-org#16913)
vulkan: remove the need for the dryrun (ggml-org#16826)
server : do context shift only while generating (ggml-org#17000)
readme : update hot topics (ggml-org#17002)
ggml-cpu : bicubic interpolation (ggml-org#16891)
ci : apply model label to models (ggml-org#16994)
chore : fix models indent after refactor (ggml-org#16992)
Fix garbled output with REPACK at high thread counts (ggml-org#16956)
...
@joseph777111
Copy link

Thank you! This fixed Unsloth's Qwen3-VL-30B-A3B-Instruct-1M GGUF quants. Also, this had made other quants more coherent, including Wayfarer-2. Great work guys! I think this will positively affect more GGUF quants than we can rightly quantify. For reference: I tested with the METAL backend on an M1 MacBook Pro 16GB (Unified Memory).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants