Name and Version
C:\llm\llama-cpp\VULKAN\b8149>llama-cli --version
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
version: 8149 (a96a112)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-bench
Command line
llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q6_K.gguf -ngl 100 -fa 0,1 --mmap 1
llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q4_K_M.gguf -ngl 100 -fa 0,1 --mmap 1
llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q8_0.gguf -ngl 100 -fa 0,1 --mmap 1
Problem description & steps to reproduce
The PP speed for the Q8_0 model is higher than for the Q4 and Q6 quants.
600 t/s vs 200 t/s
Windows 11
Intel Arc A770 (16gb) x4
Intel xeon 2699v3 x2
Driver: 8509
Models: https://huggingface.co/lmstudio-community/Qwen3.5-35B-A3B-GGUF
First is test for 3x A770
Second: 2x A770

C:\llm\llama-cpp\VULKAN\b8149>llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q8_0.gguf -ngl 100 -fa 0,1 --mmap 1
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe ?B Q8_0 | 34.36 GiB | 34.66 B | Vulkan | 100 | 0 | 1 | pp512 | 612.14 + 1.46 |
| qwen35moe ?B Q8_0 | 34.36 GiB | 34.66 B | Vulkan | 100 | 0 | 1 | tg128 | 38.88 + 0.06 |
| qwen35moe ?B Q8_0 | 34.36 GiB | 34.66 B | Vulkan | 100 | 1 | 1 | pp512 | 614.43 + 2.65 |
| qwen35moe ?B Q8_0 | 34.36 GiB | 34.66 B | Vulkan | 100 | 1 | 1 | tg128 | 37.82 + 0.04 |
build: a96a1120b (8149)
C:\llm\llama-cpp\VULKAN\b8149>set GGML_VK_VISIBLE_DEVICES=2,3
C:\llm\llama-cpp\VULKAN\b8149>llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q6_K.gguf -ngl 100 -fa 0,1 --mmap 1
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe ?B Q6_K | 26.55 GiB | 34.66 B | Vulkan | 100 | 0 | 1 | pp512 | 175.02 + 1.64 |
| qwen35moe ?B Q6_K | 26.55 GiB | 34.66 B | Vulkan | 100 | 0 | 1 | tg128 | 44.18 + 0.12 |
| qwen35moe ?B Q6_K | 26.55 GiB | 34.66 B | Vulkan | 100 | 1 | 1 | pp512 | 174.17 + 2.50 |
| qwen35moe ?B Q6_K | 26.55 GiB | 34.66 B | Vulkan | 100 | 1 | 1 | tg128 | 42.94 + 0.06 |
build: a96a1120b (8149)
C:\llm\llama-cpp\VULKAN\b8149>llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q4_K_M.gguf -ngl 100 -fa 0,1 --mmap 1
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe ?B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 100 | 0 | 1 | pp512 | 242.54 + 1.90 |
| qwen35moe ?B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 100 | 0 | 1 | tg128 | 49.18 + 0.07 |
| qwen35moe ?B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 100 | 1 | 1 | pp512 | 240.96 + 4.06 |
| qwen35moe ?B Q4_K - Medium | 19.71 GiB | 34.66 B | Vulkan | 100 | 1 | 1 | tg128 | 47.45 + 0.03 |
build: a96a1120b (8149)
Name and Version
C:\llm\llama-cpp\VULKAN\b8149>llama-cli --version
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
version: 8149 (a96a112)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-bench
Command line
Problem description & steps to reproduce
The PP speed for the Q8_0 model is higher than for the Q4 and Q6 quants.
600 t/s vs 200 t/s
Windows 11
Intel Arc A770 (16gb) x4
Intel xeon 2699v3 x2
Driver: 8509
Models: https://huggingface.co/lmstudio-community/Qwen3.5-35B-A3B-GGUF
First is test for 3x A770

Second: 2x A770