Misc. bug: Vulkan Q4_K_M inference speed degradation

### Name and Version

llama.cpp version: 4490 (adc5dd92)

windows 11 pro
dual AMD Radeon PRO W7800
vulkan SDK version: 1.3.283


### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

llama-cli

### Command line

```shell
git checkout f11cfdfd
cmake -B build-f11cfdfd -DGGML_VULKAN=ON
cmake --build .\build-f11cfdfd\ --config Release
.\build-f11cfdfd\bin\Release\llama-cli.exe -no-cnv -m "C:\Users\User\.cache\lm-studio\models\lmstudio-community\Qwen2.5-14B-Instruct-GGUF\Qwen2.5-14B-Instruct-Q4_K_M.gguf" -ngl 99 --seed 0 --temp 0 -p "<|im_start|>user
>> Tell me a 100 word story<|im_end|>
>> <|im_start|>assistant
>> "

git checkout adc5dd92
cmake -B build-adc5dd92 -DGGML_VULKAN=ON
cmake --build .\build-adc5dd92\ --config Release
.\build-adc5dd92\bin\Release\llama-cli.exe -no-cnv -m "C:\Users\User\.cache\lm-studio\models\lmstudio-community\Qwen2.5-14B-Instruct-GGUF\Qwen2.5-14B-Instruct-Q4_K_M.gguf" -ngl 99 --seed 0 --temp 0 -p "<|im_start|>user
>> Tell me a 100 word story<|im_end|>
>> <|im_start|>assistant
>> "
```

### Problem description & steps to reproduce

See the above commands to reproduce

### First Bad Commit

adc5dd92e8aea98f5e7ac84f6e1bc15de35130b5

### Relevant log output

```shell
f11cfdfd:
llama_perf_sampler_print:    sampling time =       8.70 ms /   108 runs   (    0.08 ms per token, 12410.94 tokens per second)
llama_perf_context_print:        load time =    5022.91 ms
llama_perf_context_print: prompt eval time =     164.06 ms /    17 tokens (    9.65 ms per token,   103.62 tokens per second)
llama_perf_context_print:        eval time =    2136.73 ms /    90 runs   (   23.74 ms per token,    42.12 tokens per second)
llama_perf_context_print:       total time =    2320.98 ms /   107 tokens



adc5dd92:
llama_perf_sampler_print:    sampling time =       8.74 ms /   108 runs   (    0.08 ms per token, 12356.98 tokens per second)
llama_perf_context_print:        load time =    5052.59 ms
llama_perf_context_print: prompt eval time =     164.99 ms /    17 tokens (    9.71 ms per token,   103.04 tokens per second)
llama_perf_context_print:        eval time =    2473.68 ms /    90 runs   (   27.49 ms per token,    36.38 tokens per second)
llama_perf_context_print:       total time =    2659.04 ms /   107 tokens
```

### Additional information:
I tested a few other models, and observed degradation in many different architectures for Q4_K_M. Some models I saw that experienced degradation are Qwen2.5 7B, Qwen 2.5 14B, Command R v01. I also have an unconfirmed report of Phi 4 degradation as well. Smaller models such as Qwen2.5 0.5B did not experience degradation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Vulkan Q4_K_M inference speed degradation #11559

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Additional information:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Vulkan Q4_K_M inference speed degradation #11559

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Additional information:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions