Skip to content

Misc. bug: [VULKAN[ [Intel] Qwen3-Coder-30B-A3B Low PP performance of Q4 and Q6 quants compared to Q8 on Intel Arc A770  #19887

@savvadesogle

Description

@savvadesogle

Name and Version

C:\llm\llama-cpp\VULKAN\b8149>llama-cli --version
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
version: 8149 (a96a112)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-bench

Command line

llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q6_K.gguf  -ngl 100 -fa 0,1 --mmap 1

llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q4_K_M.gguf  -ngl 100 -fa 0,1 --mmap 1

llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q8_0.gguf  -ngl 100 -fa 0,1 --mmap 1

Problem description & steps to reproduce

The PP speed for the Q8_0 model is higher than for the Q4 and Q6 quants.
600 t/s vs 200 t/s

Windows 11
Intel Arc A770 (16gb) x4
Intel xeon 2699v3 x2
Driver: 8509

Models: https://huggingface.co/lmstudio-community/Qwen3.5-35B-A3B-GGUF

First is test for 3x A770
Second: 2x A770
Image

C:\llm\llama-cpp\VULKAN\b8149>llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q8_0.gguf  -ngl 100 -fa 0,1 --mmap 1
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe ?B Q8_0              |  34.36 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           pp512 |        612.14 + 1.46 |
| qwen35moe ?B Q8_0              |  34.36 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           tg128 |         38.88 + 0.06 |
| qwen35moe ?B Q8_0              |  34.36 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           pp512 |        614.43 + 2.65 |
| qwen35moe ?B Q8_0              |  34.36 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           tg128 |         37.82 + 0.04 |

build: a96a1120b (8149)

C:\llm\llama-cpp\VULKAN\b8149>set GGML_VK_VISIBLE_DEVICES=2,3

C:\llm\llama-cpp\VULKAN\b8149>llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q6_K.gguf  -ngl 100 -fa 0,1 --mmap 1
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe ?B Q6_K              |  26.55 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           pp512 |        175.02 + 1.64 |
| qwen35moe ?B Q6_K              |  26.55 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           tg128 |         44.18 + 0.12 |
| qwen35moe ?B Q6_K              |  26.55 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           pp512 |        174.17 + 2.50 |
| qwen35moe ?B Q6_K              |  26.55 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           tg128 |         42.94 + 0.06 |

build: a96a1120b (8149)

C:\llm\llama-cpp\VULKAN\b8149>llama-bench -m T:\models\lmstudio-community\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-Q4_K_M.gguf  -ngl 100 -fa 0,1 --mmap 1
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b8149\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe ?B Q4_K - Medium     |  19.71 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           pp512 |        242.54 + 1.90 |
| qwen35moe ?B Q4_K - Medium     |  19.71 GiB |    34.66 B | Vulkan     | 100 |  0 |    1 |           tg128 |         49.18 + 0.07 |
| qwen35moe ?B Q4_K - Medium     |  19.71 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           pp512 |        240.96 + 4.06 |
| qwen35moe ?B Q4_K - Medium     |  19.71 GiB |    34.66 B | Vulkan     | 100 |  1 |    1 |           tg128 |         47.45 + 0.03 |

build: a96a1120b (8149)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions