Name and Version
version: 8660 (d006858)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
CUDA, Vulkan
Hardware
Vulkan1 (RTX 2000 Ada Generation Laptop GPU)
Models
ggml-org/gemma-4-E2B-it-GGUF, ggml-org/gemma-4-E4B-it-GGUF
Problem description & steps to reproduce
Command:
llama-server -hf ggml-org/gemma-4-E2B-it-GGUF -fa on --cache-reuse 256 --swa-full
Observed behavior:
On every request, even when a nearly identical prompt was processed in the previous request, the server logs:
slot update_slots: id 1 | task 314 | cache reuse is not supported - ignoring n_cache_reuse = 256
slot update_slots: id 1 | task 314 | n_tokens = 0, memory_seq_rm [0, end)
The prompt cache save/load infrastructure is working (the previous slot's state is saved, ~298 MiB for a 46K token prompt), but the similarity check returns sim = 0.000 and cache reuse is skipped entirely:
srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
This results in full prompt re-evaluation on every request (~46K tokens, ~96 seconds on the test hardware).
Practical implication:
Claude Code requests add 30K-40K tokens on top of the user message (system prompt, system tools, MCP servers). As a result, the user has to wait 60-90 seconds every time before gemma starts outputting the first tokens.
Root cause hypothesis:
Gemma 4 uses a Shared KV Cache architecture where the last num_kv_shared_layers layers reuse K/V tensors from the last non-shared layer rather than computing their own. This architectural property likely breaks the assumptions in the cache reuse / prefix matching code, causing it to explicitly bail out with "cache reuse is not supported."
Expected behavior:
Either cache reuse works correctly accounting for shared KV layers, or the error message explicitly names the shared KV cache architecture as the reason so users understand why.
First Bad Commit
No response
Relevant log output
llama-server -hf ggml-org/gemma-4-E2B-it-GGUF -fa on --cache-reuse 256 --swa-full
slot update_slots: id 1 | task 314 | cache reuse is not supported - ignoring n_cache_reuse = 256
Name and Version
version: 8660 (d006858)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
CUDA, Vulkan
Hardware
Vulkan1 (RTX 2000 Ada Generation Laptop GPU)
Models
ggml-org/gemma-4-E2B-it-GGUF, ggml-org/gemma-4-E4B-it-GGUF
Problem description & steps to reproduce
Command:
llama-server -hf ggml-org/gemma-4-E2B-it-GGUF -fa on --cache-reuse 256 --swa-full
Observed behavior:
On every request, even when a nearly identical prompt was processed in the previous request, the server logs:
slot update_slots: id 1 | task 314 | cache reuse is not supported - ignoring n_cache_reuse = 256
slot update_slots: id 1 | task 314 | n_tokens = 0, memory_seq_rm [0, end)
The prompt cache save/load infrastructure is working (the previous slot's state is saved, ~298 MiB for a 46K token prompt), but the similarity check returns sim = 0.000 and cache reuse is skipped entirely:
srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
This results in full prompt re-evaluation on every request (~46K tokens, ~96 seconds on the test hardware).
Practical implication:
Claude Code requests add 30K-40K tokens on top of the user message (system prompt, system tools, MCP servers). As a result, the user has to wait 60-90 seconds every time before gemma starts outputting the first tokens.
Root cause hypothesis:
Gemma 4 uses a Shared KV Cache architecture where the last num_kv_shared_layers layers reuse K/V tensors from the last non-shared layer rather than computing their own. This architectural property likely breaks the assumptions in the cache reuse / prefix matching code, causing it to explicitly bail out with "cache reuse is not supported."
Expected behavior:
Either cache reuse works correctly accounting for shared KV layers, or the error message explicitly names the shared KV cache architecture as the reason so users understand why.
First Bad Commit
No response
Relevant log output
llama-server -hf ggml-org/gemma-4-E2B-it-GGUF -fa on --cache-reuse 256 --swa-full
slot update_slots: id 1 | task 314 | cache reuse is not supported - ignoring n_cache_reuse = 256