Skip to content

Misc. bug: Vulkan Q4_K_M inference speed degradation #11559

Closed
@neilmehta24

Description

@neilmehta24

Name and Version

llama.cpp version: 4490 (adc5dd9)

windows 11 pro
dual AMD Radeon PRO W7800
vulkan SDK version: 1.3.283

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

git checkout f11cfdfd
cmake -B build-f11cfdfd -DGGML_VULKAN=ON
cmake --build .\build-f11cfdfd\ --config Release
.\build-f11cfdfd\bin\Release\llama-cli.exe -no-cnv -m "C:\Users\User\.cache\lm-studio\models\lmstudio-community\Qwen2.5-14B-Instruct-GGUF\Qwen2.5-14B-Instruct-Q4_K_M.gguf" -ngl 99 --seed 0 --temp 0 -p "<|im_start|>user
>> Tell me a 100 word story<|im_end|>
>> <|im_start|>assistant
>> "

git checkout adc5dd92
cmake -B build-adc5dd92 -DGGML_VULKAN=ON
cmake --build .\build-adc5dd92\ --config Release
.\build-adc5dd92\bin\Release\llama-cli.exe -no-cnv -m "C:\Users\User\.cache\lm-studio\models\lmstudio-community\Qwen2.5-14B-Instruct-GGUF\Qwen2.5-14B-Instruct-Q4_K_M.gguf" -ngl 99 --seed 0 --temp 0 -p "<|im_start|>user
>> Tell me a 100 word story<|im_end|>
>> <|im_start|>assistant
>> "

Problem description & steps to reproduce

See the above commands to reproduce

First Bad Commit

adc5dd9

Relevant log output

f11cfdfd:
llama_perf_sampler_print:    sampling time =       8.70 ms /   108 runs   (    0.08 ms per token, 12410.94 tokens per second)
llama_perf_context_print:        load time =    5022.91 ms
llama_perf_context_print: prompt eval time =     164.06 ms /    17 tokens (    9.65 ms per token,   103.62 tokens per second)
llama_perf_context_print:        eval time =    2136.73 ms /    90 runs   (   23.74 ms per token,    42.12 tokens per second)
llama_perf_context_print:       total time =    2320.98 ms /   107 tokens



adc5dd92:
llama_perf_sampler_print:    sampling time =       8.74 ms /   108 runs   (    0.08 ms per token, 12356.98 tokens per second)
llama_perf_context_print:        load time =    5052.59 ms
llama_perf_context_print: prompt eval time =     164.99 ms /    17 tokens (    9.71 ms per token,   103.04 tokens per second)
llama_perf_context_print:        eval time =    2473.68 ms /    90 runs   (   27.49 ms per token,    36.38 tokens per second)
llama_perf_context_print:       total time =    2659.04 ms /   107 tokens

Additional information:

I tested a few other models, and observed degradation in many different architectures for Q4_K_M. Some models I saw that experienced degradation are Qwen2.5 7B, Qwen 2.5 14B, Command R v01. I also have an unconfirmed report of Phi 4 degradation as well. Smaller models such as Qwen2.5 0.5B did not experience degradation

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions