Skip to content

cache reuse is not supported for Gemma 4 models despite -fa enabled and --swa-full #21468

@phuryn

Description

@phuryn

Name and Version

version: 8660 (d006858)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

GGML backends

CUDA, Vulkan

Hardware

Vulkan1 (RTX 2000 Ada Generation Laptop GPU)

Models

ggml-org/gemma-4-E2B-it-GGUF, ggml-org/gemma-4-E4B-it-GGUF

Problem description & steps to reproduce

Command:
llama-server -hf ggml-org/gemma-4-E2B-it-GGUF -fa on --cache-reuse 256 --swa-full

Observed behavior:
On every request, even when a nearly identical prompt was processed in the previous request, the server logs:
slot update_slots: id 1 | task 314 | cache reuse is not supported - ignoring n_cache_reuse = 256
slot update_slots: id 1 | task 314 | n_tokens = 0, memory_seq_rm [0, end)
The prompt cache save/load infrastructure is working (the previous slot's state is saved, ~298 MiB for a 46K token prompt), but the similarity check returns sim = 0.000 and cache reuse is skipped entirely:
srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
This results in full prompt re-evaluation on every request (~46K tokens, ~96 seconds on the test hardware).

Practical implication:
Claude Code requests add 30K-40K tokens on top of the user message (system prompt, system tools, MCP servers). As a result, the user has to wait 60-90 seconds every time before gemma starts outputting the first tokens.

Root cause hypothesis:
Gemma 4 uses a Shared KV Cache architecture where the last num_kv_shared_layers layers reuse K/V tensors from the last non-shared layer rather than computing their own. This architectural property likely breaks the assumptions in the cache reuse / prefix matching code, causing it to explicitly bail out with "cache reuse is not supported."

Expected behavior:
Either cache reuse works correctly accounting for shared KV layers, or the error message explicitly names the shared KV cache architecture as the reason so users understand why.

First Bad Commit

No response

Relevant log output

llama-server -hf ggml-org/gemma-4-E2B-it-GGUF -fa on --cache-reuse 256 --swa-full

slot update_slots: id 1 | task 314 | cache reuse is not supported - ignoring n_cache_reuse = 256

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions