Skip to content

Eval bug: Gemma4 attn_rot_k and v = 0 #21394

@vektorprime

Description

@vektorprime

Name and Version

docker run -p 8080:8080 --runtime=nvidia --gpus all -v /home/user/llm/models:/models ghcr.io/ggml-org/llama.cpp:server-cuda --version

ggml_cuda_init: found 2 CUDA devices (Total VRAM: 29928 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3050, compute capability 8.6, VMM: yes, VRAM: 5804 MiB
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
version: 8643 (f49e917)
built with GNU 14.2.0 for Linux x86_64

Operating systems

Linux (ubuntu 22)

GGML backends

CUDA

Hardware

rtx 3090 and 3050 (tried with just 3090 too)

Models

Gemma 4 31B

Problem description & steps to reproduce

I noticed the attention rotation for KV cache does not work with Gemma4, whereas it works (on this system) with Qwen3.5.

Is that expected?

llama_kv_cache_iswa: creating non-SWA KV cache, size = 120064 cells
llama_kv_cache: CUDA0 KV buffer size = 2901.94 MiB
llama_kv_cache: CUDA1 KV buffer size = 322.44 MiB
llama_kv_cache: size = 3224.38 MiB (120064 cells, 10 layers, 1/1 seqs), K (q5_0): 1612.19 MiB, V (q5_0): 1612.19 MiB
llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0

llama_kv_cache_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache: CUDA0 KV buffer size = 1128.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 72.00 MiB
llama_kv_cache: size = 1200.00 MiB ( 1536 cells, 50 layers, 1/1 seqs), K (f16): 600.00 MiB, V (f16): 600.00 MiB
llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0

EDIT PLEASE NOTE: After updating and getting both SWA and non-SWA KV caches to the SAME quant, the attn_rot_k and v are STILL showing 0.

docker run -p 8080:8080 --runtime=nvidia --gpus all -v /home/user/llm/models:/models ghcr.io/ggml-org/llama.cpp:server-cuda
-m /models/Gemma4-31B/gemma-4-31B-it-UD-Q4_K_XL.gguf
--port 8080 --host 0.0.0.0
--no-mmap --threads 8 --jinja
--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on -kvu --ctx-size 100000 -np 1
--temp 1 --top-p 0.95 --top-k 64

First Bad Commit

No response

Relevant log output

Logs
llama_kv_cache_iswa: creating non-SWA KV cache, size = 120064 cells
llama_kv_cache:      CUDA0 KV buffer size =  2901.94 MiB
llama_kv_cache:      CUDA1 KV buffer size =   322.44 MiB
llama_kv_cache: size = 3224.38 MiB (120064 cells,  10 layers,  1/1 seqs), K (q5_0): 1612.19 MiB, V (q5_0): 1612.19 MiB
**llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0**
llama_kv_cache_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache:      CUDA0 KV buffer size =  1128.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =    72.00 MiB
llama_kv_cache: size = 1200.00 MiB (  1536 cells,  50 layers,  1/1 seqs), K (f16):  600.00 MiB, V (f16):  600.00 MiB
**llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0**

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions