Name and Version
b8660 (working), b8661 (broken)
Operating systems
Windows
GGML backends
CUDA
Hardware
NVIDIA GeForce RTX 5060 Ti 16GB
Models
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf
Problem description & steps to reproduce
After updating from b8660 to b8661, Gemma 4 started returning empty strings '' for Unicode characters (specifically chess piece symbols and likely other multi-byte UTF-8 characters) on Windows. Rolling back to b8660 fixes it immediately.Same GGUF, same command, same hardware.
The root cause is visible in the startup logs: b8661 loads the model with vocab type = BPE and n_merges = 514906, while b8660 loads the exact same GGUF with vocab type = SPM and n_merges = 0. Something in PR #21406 (or its interaction with #21343) changed the tokenizer routing on Windows, switching from SPM to BPE mode. The BPE path appears to silently drop multi-byte UTF-8 sequences on output on Windows.
Steps to reproduce:
Example Prompt:
const PIECES = {
wP: '', wR: '♖', wN: '♘', wB: '♗', wQ: '♕', wK: '♔',
bP: '♟', bR: '', bN: '', bB: '♝', bQ: '♛', bK: '♚'
};
fill the empty pieces
b8661
const PIECES = {
wP: '', wR: '♖', wN: '♘', wB: '♗', wQ: '♕', wK: '♔',
bP: '♟', bR: '', bN: '', bB: '♝', bQ: '♛', bK: '♚'
};
b8660
const PIECES = {
wP: '♙', wR: '♖', wN: '♘', wB: '♗', wQ: '♕', wK: '♔',
bP: '♟', bR: '♜', bN: '♞', bB: '♝', bQ: '♛', bK: '♚'
};
First Bad Commit
PR #21406 (b8661) — "llama: add custom newline split for Gemma 4"
Relevant log output
b8661 (broken, Windows):
print_info: vocab type = BPE
print_info: n_merges = 514906
print_info: LF token = 107 '<actual newline>'
b8660 (working, Windows and Linux):
print_info: vocab type = SPM
print_info: n_merges = 0
print_info: LF token = 248 '<0x0A>'
Name and Version
b8660 (working), b8661 (broken)
Operating systems
Windows
GGML backends
CUDA
Hardware
NVIDIA GeForce RTX 5060 Ti 16GB
Models
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf
Problem description & steps to reproduce
After updating from b8660 to b8661, Gemma 4 started returning empty strings
''for Unicode characters (specifically chess piece symbols and likely other multi-byte UTF-8 characters) on Windows. Rolling back to b8660 fixes it immediately.Same GGUF, same command, same hardware.The root cause is visible in the startup logs: b8661 loads the model with
vocab type = BPEandn_merges = 514906, while b8660 loads the exact same GGUF withvocab type = SPMandn_merges = 0. Something in PR #21406 (or its interaction with #21343) changed the tokenizer routing on Windows, switching from SPM to BPE mode. The BPE path appears to silently drop multi-byte UTF-8 sequences on output on Windows.Steps to reproduce:
Example Prompt:
fill the empty pieces
b8661
b8660
First Bad Commit
PR #21406 (b8661) — "llama: add custom newline split for Gemma 4"
Relevant log output
b8661 (broken, Windows):
b8660 (working, Windows and Linux):