Misc. bug: [VULKAN[ [Intel] Qwen3-Coder-30B-A3B Low PP performance of Q4 and Q6 quants compared to Q8 on Intel Arc A770 

### Name and Version

C:\llm\llama-cpp\VULKAN\b8149>llama-cli --version
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
version: 8149 (a96a1120b)
built with Clang 19.1.5 for Windows x86_64

### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

llama-bench

### Command line

```shell
llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q6_K.gguf  -ngl 100 -fa 0,1 --mmap 1

llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q4_K_M.gguf  -ngl 100 -fa 0,1 --mmap 1

llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q8_0.gguf  -ngl 100 -fa 0,1 --mmap 1
```

### Problem description & steps to reproduce

The PP speed for the Q8_0 model is higher than for the Q4 and Q6 quants.
**600 t/s vs 200 t/s**

Windows 11
Intel Arc A770 (16gb) x4
Intel xeon 2699v3 x2
Driver: 8509

Models: https://huggingface.co/lmstudio-community/Qwen3.5-35B-A3B-GGUF

First is test for 3x A770
Second: 2x A770
<img width="1378" height="757" alt="Image" src="https://github.com/user-attachments/assets/5c976ab3-d049-4c84-8e54-9124c12f627d" />

```
C:\llm\llama-cpp\VULKAN\b8149>llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q8_0.gguf  -ngl 100 -fa 0,1 --mmap 1
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe ?B Q8_0              |  34.36 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           pp512 |        612.14 + 1.46 |
| qwen35moe ?B Q8_0              |  34.36 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           tg128 |         38.88 + 0.06 |
| qwen35moe ?B Q8_0              |  34.36 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           pp512 |        614.43 + 2.65 |
| qwen35moe ?B Q8_0              |  34.36 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           tg128 |         37.82 + 0.04 |

build: a96a1120b (8149)

C:\llm\llama-cpp\VULKAN\b8149>set GGML_VK_VISIBLE_DEVICES=2,3

C:\llm\llama-cpp\VULKAN\b8149>llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q6_K.gguf  -ngl 100 -fa 0,1 --mmap 1
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe ?B Q6_K              |  26.55 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           pp512 |        175.02 + 1.64 |
| qwen35moe ?B Q6_K              |  26.55 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           tg128 |         44.18 + 0.12 |
| qwen35moe ?B Q6_K              |  26.55 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           pp512 |        174.17 + 2.50 |
| qwen35moe ?B Q6_K              |  26.55 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           tg128 |         42.94 + 0.06 |

build: a96a1120b (8149)

C:\llm\llama-cpp\VULKAN\b8149>llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q4_K_M.gguf  -ngl 100 -fa 0,1 --mmap 1
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe ?B Q4_K - Medium     |  19.71 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           pp512 |        242.54 + 1.90 |
| qwen35moe ?B Q4_K - Medium     |  19.71 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           tg128 |         49.18 + 0.07 |
| qwen35moe ?B Q4_K - Medium     |  19.71 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           pp512 |        240.96 + 4.06 |
| qwen35moe ?B Q4_K - Medium     |  19.71 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           tg128 |         47.45 + 0.03 |

build: a96a1120b (8149)
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: [VULKAN[ [Intel] Qwen3-Coder-30B-A3B Low PP performance of Q4 and Q6 quants compared to Q8 on Intel Arc A770 #19887

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Misc. bug: [VULKAN[ [Intel] Qwen3-Coder-30B-A3B Low PP performance of Q4 and Q6 quants compared to Q8 on Intel Arc A770 #19887

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions