Eval bug: Gemma 4 fails logprob comparison against itself.

### Name and Version

ggml_cuda_init: found 2 CUDA devices (Total VRAM: 32767 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
  Device 1: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes, VRAM: 8191 MiB
load_backend: loaded CUDA backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-cuda.dll
load_backend: loaded RPC backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-rpc.dll
load_backend: loaded CPU backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-cpu-zen4.dll
version: 8681 (506200cf8)
built with Clang 19.1.5 for Windows x86_64


### Operating systems

Windows

### GGML backends

CUDA

### Hardware

3090ti + AMD Ryzen 9 7900X in this setup. My second GPU, 2060Ti was not used. 

Tested with both a quantized model, and BF16. Same result.



### Models

gemma-4-26B-A4B, primary quants were the launch day bf16 from unsloth and the q4_k_xl from yesterday

### Problem description & steps to reproduce

Run llama server with default settings, connect to the provided demo. it checks the logprobs, it hightlights missmatches.

  "max_tokens": 512,
  "temperature": 0,
  "top_p": 1,
  "repetition_penalty": 1,
  "presence_penalty": 0,
  "frequency_penalty": 0,
  "logprobs": 10,
  "stream": true,
  "seed": 42

Models seem to be matching up with their own probabilities just fine usually, Gemma 4 does not seem to play ball.



I have been working on a setup to highlight various aspects of quantization, across domains etc to highlight various insights beyond KLD and PPL. However, I noticed that Gemma 4 was extremely unstable.


I got Claude to quickly throw together a setup that applies the template and checks the probabilities in an isolated system.
Qwen 3.5 9B shows no difference, and Gemma 4 26B shows a difference. I also checked Minsitral 8B, no issues.

It does not seem to be able to be even close to deterministic. 

### Gemma 4 26B:

<img width="856" height="996" alt="Image" src="https://github.com/user-attachments/assets/da66588b-0569-4063-a3ea-e1141b9edb4c" />

### Qwen 3.5 9B:

<img width="950" height="822" alt="Image" src="https://github.com/user-attachments/assets/30da31a2-fabc-490e-8fb0-5dc1cb1d8005" />

### Ministral 8B:

<img width="1131" height="943" alt="Image" src="https://github.com/user-attachments/assets/137dbfee-30cd-44b5-8efa-9baa5b9c2b17" />


Switching it to CPU, all is fine again. 

<img width="1102" height="934" alt="Image" src="https://github.com/user-attachments/assets/d18d35cd-50ae-40df-baeb-dc7eb5fcfaa7" />

Here is the self check setup generated for this isolated test. 

[selfcheck.html](https://github.com/user-attachments/files/26521560/selfcheck.html)

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Gemma 4 fails logprob comparison against itself. #21532

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Gemma 4 26B:

Qwen 3.5 9B:

Ministral 8B:

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: Gemma 4 fails logprob comparison against itself. #21532

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Gemma 4 26B:

Qwen 3.5 9B:

Ministral 8B:

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions