cache reuse is not supported for Gemma 4 models despite -fa enabled and --swa-full

### Name and Version

version: 8660 (d00685831)
built with Clang 19.1.5 for Windows x86_64

### Operating systems

Windows

### GGML backends

CUDA, Vulkan

### Hardware

Vulkan1 (RTX 2000 Ada Generation Laptop GPU) 

### Models

ggml-org/gemma-4-E2B-it-GGUF, ggml-org/gemma-4-E4B-it-GGUF

### Problem description & steps to reproduce

**Command:**
llama-server -hf ggml-org/gemma-4-E2B-it-GGUF -fa on --cache-reuse 256 --swa-full

**Observed behavior:**
On every request, even when a nearly identical prompt was processed in the previous request, the server logs:
slot update_slots: id  1 | task 314 | cache reuse is not supported - ignoring n_cache_reuse = 256
slot update_slots: id  1 | task 314 | n_tokens = 0, memory_seq_rm [0, end)
The prompt cache save/load infrastructure is working (the previous slot's state is saved, ~298 MiB for a 46K token prompt), but the similarity check returns sim = 0.000 and cache reuse is skipped entirely:
srv  load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
This results in full prompt re-evaluation on every request (~46K tokens, ~96 seconds on the test hardware).

**Practical implication:**
Claude Code requests add 30K-40K tokens on top of the user message (system prompt, system tools, MCP servers). As a result, the user has to wait 60-90 seconds every time before gemma starts outputting the first tokens.

**Root cause hypothesis:**
Gemma 4 uses a [Shared KV Cache architecture](https://huggingface.co/blog/gemma4#shared-kv-cache) where the last num_kv_shared_layers layers reuse K/V tensors from the last non-shared layer rather than computing their own. This architectural property likely breaks the assumptions in the cache reuse / prefix matching code, causing it to explicitly bail out with "cache reuse is not supported."

**Expected behavior:**
Either cache reuse works correctly accounting for shared KV layers, or the error message explicitly names the shared KV cache architecture as the reason so users understand why.

### First Bad Commit

_No response_

### Relevant log output

llama-server -hf ggml-org/gemma-4-E2B-it-GGUF -fa on --cache-reuse 256 --swa-full

slot update_slots: id  1 | task 314 | cache reuse is not supported - ignoring n_cache_reuse = 256

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache reuse is not supported for Gemma 4 models despite -fa enabled and --swa-full #21468

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

cache reuse is not supported for Gemma 4 models despite -fa enabled and --swa-full #21468

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions