Skip to content

Question: K/V Quantization q5_0 slower than q4_0 and q8_0 #21295

@winstonma

Description

@winstonma

Git commit

I am running version b8625

Operating systems

Linux

GGML backends

Vulkan

Problem description & steps to reproduce

I am running Qwen3.5-0.8B-Q4_K_M.gguf (with flash-att on). Trying the benchmark the result of the new KV cache quantization.

K/V Quantization Token Generation per second
f16 91
q8_0 88
q4_0 88
q5_0 75

Should I expect the performance of q5_0 sit between q4_0 and q8_0?

Modify the ctk and ctv to different value

llama-cli -m ~/model/Qwen3.5-0.8B-Q4_K_M.gguf -ctk q8_0 -ctv q8_0 --reasoning off --flash-attn on

First Bad Commit

No response

Compile command

I am using CachyOS default compile. The build flag is in PKGBUILD.

Relevant log output

llama_context: Vulkan_Host  output buffer size =     0.95 MiB
llama_kv_cache: layer   0: filtered
llama_kv_cache: layer   1: filtered
llama_kv_cache: layer   2: filtered
llama_kv_cache: layer   3: dev = Vulkan0
llama_kv_cache: layer   4: filtered
llama_kv_cache: layer   5: filtered
llama_kv_cache: layer   6: filtered
llama_kv_cache: layer   7: dev = Vulkan0
llama_kv_cache: layer   8: filtered
llama_kv_cache: layer   9: filtered
llama_kv_cache: layer  10: filtered
llama_kv_cache: layer  11: dev = Vulkan0
llama_kv_cache: layer  12: filtered
llama_kv_cache: layer  13: filtered
llama_kv_cache: layer  14: filtered
llama_kv_cache: layer  15: dev = Vulkan0
llama_kv_cache: layer  16: filtered
llama_kv_cache: layer  17: filtered
llama_kv_cache: layer  18: filtered
llama_kv_cache: layer  19: dev = Vulkan0
llama_kv_cache: layer  20: filtered
llama_kv_cache: layer  21: filtered
llama_kv_cache: layer  22: filtered
llama_kv_cache: layer  23: dev = Vulkan0
llama_kv_cache:    Vulkan0 KV buffer size =     0.00 MiB
llama_kv_cache: size = 1056.00 MiB (262144 cells,   6 layers,  1/1 seqs), K (q5_0):  528.00 MiB, V (q5_0):  528.00 MiB
llama_kv_cache: attn_rot_k = 1
llama_kv_cache: attn_rot_v = 1
llama_memory_recurrent, layer   0: dev = Vulkan0
llama_memory_recurrent, layer   1: dev = Vulkan0
llama_memory_recurrent, layer   2: dev = Vulkan0
llama_memory_recurrent: layer   3: skipped
llama_memory_recurrent, layer   4: dev = Vulkan0
llama_memory_recurrent, layer   5: dev = Vulkan0
llama_memory_recurrent, layer   6: dev = Vulkan0
llama_memory_recurrent: layer   7: skipped
llama_memory_recurrent, layer   8: dev = Vulkan0
llama_memory_recurrent, layer   9: dev = Vulkan0
llama_memory_recurrent, layer  10: dev = Vulkan0
llama_memory_recurrent: layer  11: skipped
llama_memory_recurrent, layer  12: dev = Vulkan0
llama_memory_recurrent, layer  13: dev = Vulkan0
llama_memory_recurrent, layer  14: dev = Vulkan0
llama_memory_recurrent: layer  15: skipped
llama_memory_recurrent, layer  16: dev = Vulkan0
llama_memory_recurrent, layer  17: dev = Vulkan0
llama_memory_recurrent, layer  18: dev = Vulkan0
llama_memory_recurrent: layer  19: skipped
llama_memory_recurrent, layer  20: dev = Vulkan0
llama_memory_recurrent, layer  21: dev = Vulkan0
llama_memory_recurrent, layer  22: dev = Vulkan0
llama_memory_recurrent: layer  23: skipped
llama_memory_recurrent:    Vulkan0 RS buffer size =    19.27 MiB
llama_memory_recurrent: size =   19.27 MiB (     1 cells,  24 layers,  1 seqs), R (f32):    1.27 MiB, S (f32):   18.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions