Eval bug: Gemma4 attn_rot_k and v = 0

### Name and Version

docker run -p 8080:8080 --runtime=nvidia --gpus all -v /home/user/llm/models:/models ghcr.io/ggml-org/llama.cpp:server-cuda --version

ggml_cuda_init: found 2 CUDA devices (Total VRAM: 29928 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
  Device 1: NVIDIA GeForce RTX 3050, compute capability 8.6, VMM: yes, VRAM: 5804 MiB
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
version: 8643 (f49e91787)
built with GNU 14.2.0 for Linux x86_64

### Operating systems

Linux (ubuntu 22)

### GGML backends

CUDA

### Hardware

rtx 3090 and 3050 (tried with just 3090 too)

### Models

Gemma 4 31B

### Problem description & steps to reproduce

I noticed the attention rotation for KV cache does not work with Gemma4, whereas it works (on this system) with Qwen3.5.

Is that expected?

llama_kv_cache_iswa: creating non-SWA KV cache, size = 120064 cells
llama_kv_cache:      CUDA0 KV buffer size =  2901.94 MiB
llama_kv_cache:      CUDA1 KV buffer size =   322.44 MiB
llama_kv_cache: size = 3224.38 MiB (120064 cells,  10 layers,  1/1 seqs), K (q5_0): 1612.19 MiB, V (q5_0): 1612.19 MiB
**llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0**
llama_kv_cache_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache:      CUDA0 KV buffer size =  1128.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =    72.00 MiB
llama_kv_cache: size = 1200.00 MiB (  1536 cells,  50 layers,  1/1 seqs), K (f16):  600.00 MiB, V (f16):  600.00 MiB
**llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0**

**EDIT PLEASE NOTE: After updating and getting both SWA and non-SWA KV caches to the SAME quant, the attn_rot_k and v are STILL showing 0.**

docker run -p 8080:8080 --runtime=nvidia --gpus all -v /home/user/llm/models:/models ghcr.io/ggml-org/llama.cpp:server-cuda \
 -m /models/Gemma4-31B/gemma-4-31B-it-UD-Q4_K_XL.gguf \
 --port 8080 --host 0.0.0.0 \
 --no-mmap --threads 8 --jinja \
 --cache-type-k q8_0 --cache-type-v q8_0  --flash-attn on -kvu --ctx-size 100000 -np 1   \
 --temp 1 --top-p 0.95 --top-k 64 

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

llama_kv_cache_iswa: creating non-SWA KV cache, size = 120064 cells
llama_kv_cache:      CUDA0 KV buffer size =  2901.94 MiB
llama_kv_cache:      CUDA1 KV buffer size =   322.44 MiB
llama_kv_cache: size = 3224.38 MiB (120064 cells,  10 layers,  1/1 seqs), K (q5_0): 1612.19 MiB, V (q5_0): 1612.19 MiB
**llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0**
llama_kv_cache_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache:      CUDA0 KV buffer size =  1128.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =    72.00 MiB
llama_kv_cache: size = 1200.00 MiB (  1536 cells,  50 layers,  1/1 seqs), K (f16):  600.00 MiB, V (f16):  600.00 MiB
**llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0**
```
</details>




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Gemma4 attn_rot_k and v = 0 #21394

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: Gemma4 attn_rot_k and v = 0 #21394

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions