Skip to content

Eval bug: Gemma 4 fails logprob comparison against itself. #21532

@espen96

Description

@espen96

Name and Version

ggml_cuda_init: found 2 CUDA devices (Total VRAM: 32767 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 1: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes, VRAM: 8191 MiB
load_backend: loaded CUDA backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-cuda.dll
load_backend: loaded RPC backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-rpc.dll
load_backend: loaded CPU backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-cpu-zen4.dll
version: 8681 (506200c)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

GGML backends

CUDA

Hardware

3090ti + AMD Ryzen 9 7900X in this setup. My second GPU, 2060Ti was not used.

Tested with both a quantized model, and BF16. Same result.

Models

gemma-4-26B-A4B, primary quants were the launch day bf16 from unsloth and the q4_k_xl from yesterday

Problem description & steps to reproduce

Run llama server with default settings, connect to the provided demo. it checks the logprobs, it hightlights missmatches.

"max_tokens": 512,
"temperature": 0,
"top_p": 1,
"repetition_penalty": 1,
"presence_penalty": 0,
"frequency_penalty": 0,
"logprobs": 10,
"stream": true,
"seed": 42

Models seem to be matching up with their own probabilities just fine usually, Gemma 4 does not seem to play ball.

I have been working on a setup to highlight various aspects of quantization, across domains etc to highlight various insights beyond KLD and PPL. However, I noticed that Gemma 4 was extremely unstable.

I got Claude to quickly throw together a setup that applies the template and checks the probabilities in an isolated system.
Qwen 3.5 9B shows no difference, and Gemma 4 26B shows a difference. I also checked Minsitral 8B, no issues.

It does not seem to be able to be even close to deterministic.

Gemma 4 26B:

Image

Qwen 3.5 9B:

Image

Ministral 8B:

Image

Switching it to CPU, all is fine again.

Image

Here is the self check setup generated for this isolated test.

selfcheck.html

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions