Name and Version
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 32767 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 1: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes, VRAM: 8191 MiB
load_backend: loaded CUDA backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-cuda.dll
load_backend: loaded RPC backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-rpc.dll
load_backend: loaded CPU backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-cpu-zen4.dll
version: 8681 (506200c)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
CUDA
Hardware
3090ti + AMD Ryzen 9 7900X in this setup. My second GPU, 2060Ti was not used.
Tested with both a quantized model, and BF16. Same result.
Models
gemma-4-26B-A4B, primary quants were the launch day bf16 from unsloth and the q4_k_xl from yesterday
Problem description & steps to reproduce
Run llama server with default settings, connect to the provided demo. it checks the logprobs, it hightlights missmatches.
"max_tokens": 512,
"temperature": 0,
"top_p": 1,
"repetition_penalty": 1,
"presence_penalty": 0,
"frequency_penalty": 0,
"logprobs": 10,
"stream": true,
"seed": 42
Models seem to be matching up with their own probabilities just fine usually, Gemma 4 does not seem to play ball.
I have been working on a setup to highlight various aspects of quantization, across domains etc to highlight various insights beyond KLD and PPL. However, I noticed that Gemma 4 was extremely unstable.
I got Claude to quickly throw together a setup that applies the template and checks the probabilities in an isolated system.
Qwen 3.5 9B shows no difference, and Gemma 4 26B shows a difference. I also checked Minsitral 8B, no issues.
It does not seem to be able to be even close to deterministic.
Gemma 4 26B:
Qwen 3.5 9B:
Ministral 8B:
Switching it to CPU, all is fine again.
Here is the self check setup generated for this isolated test.
selfcheck.html
First Bad Commit
No response
Relevant log output
Logs
Name and Version
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 32767 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 1: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes, VRAM: 8191 MiB
load_backend: loaded CUDA backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-cuda.dll
load_backend: loaded RPC backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-rpc.dll
load_backend: loaded CPU backend from F:\llama-swap_123_windows_amd64\engines\llama.cpp verify\ggml-cpu-zen4.dll
version: 8681 (506200c)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
CUDA
Hardware
3090ti + AMD Ryzen 9 7900X in this setup. My second GPU, 2060Ti was not used.
Tested with both a quantized model, and BF16. Same result.
Models
gemma-4-26B-A4B, primary quants were the launch day bf16 from unsloth and the q4_k_xl from yesterday
Problem description & steps to reproduce
Run llama server with default settings, connect to the provided demo. it checks the logprobs, it hightlights missmatches.
"max_tokens": 512,
"temperature": 0,
"top_p": 1,
"repetition_penalty": 1,
"presence_penalty": 0,
"frequency_penalty": 0,
"logprobs": 10,
"stream": true,
"seed": 42
Models seem to be matching up with their own probabilities just fine usually, Gemma 4 does not seem to play ball.
I have been working on a setup to highlight various aspects of quantization, across domains etc to highlight various insights beyond KLD and PPL. However, I noticed that Gemma 4 was extremely unstable.
I got Claude to quickly throw together a setup that applies the template and checks the probabilities in an isolated system.
Qwen 3.5 9B shows no difference, and Gemma 4 26B shows a difference. I also checked Minsitral 8B, no issues.
It does not seem to be able to be even close to deterministic.
Gemma 4 26B:
Qwen 3.5 9B:
Ministral 8B:
Switching it to CPU, all is fine again.
Here is the self check setup generated for this isolated test.
selfcheck.html
First Bad Commit
No response
Relevant log output
Logs