fix(tokenizer): route Qwen / GPT-2 BPE GGUFs to upstream byte-level BPE by michalharakal · Pull Request #115 · SKaiNET-developers/SKaiNET-transformers

michalharakal · 2026-05-04T11:01:54Z

Closes #52.

Why

The local GGUFTokenizer.encodeBPE() is the SentencePiece greedy-by-score algorithm. It's the wrong algorithm for byte-level BPE (Qwen / GPT-2 / Mistral-Nemo): missing the bytes_to_unicode() mapping, missing merge-rank-based merging, and missing GPT-2 pretokenization regex. Per #52 the symptom is chat-template prompts encoding to broken token sequences, so Qwen3 generates CJK / URL-encoded / HTML-entity nonsense in tool-calling and chat modes.

The fix is already in upstream SKaiNET 0.23.1 — sk.ainet.io.tokenizer.QwenByteLevelBpeTokenizer implements the full seven-step HF / llama.cpp pipeline. This PR routes Qwen-family GGUFs there. Pinned SKaiNET version is unchanged.

What changes

New: UpstreamTokenizerAdapter in :llm-core (commonMain) — bridges upstream's sk.ainet.io.tokenizer.Tokenizer (nullable bos/eos, no single-token decode) into this repo's sk.ainet.apps.llm.Tokenizer interface (non-null bos/eos, single-token decode).
TokenizerFactory.fromGGUF(source) now returns Tokenizer (interface) instead of GGUFTokenizer (concrete). Inside, peeks tokenizer.ggml.model once and dispatches:
- gpt2 / bpe → upstream QwenByteLevelBpeTokenizer via adapter ✅ correct
- everything else → existing local GGUFTokenizer (SentencePiece path, unchanged)
GGUFTokenizer.fromStreamingFields — was private, now public so the factory can hand off pre-parsed fields without re-opening the source. Doc warns external callers to prefer the factory.
Two CLI callers migrated from GGUFTokenizer.fromRandomAccessSource(source) → TokenizerFactory.fromGGUF(source):
- :llm-runtime:kllama cli/Main.kt
- :llm-runtime:kllama java/KLlamaJava.kt (also: tokenizer.eosId → tokenizer.eosTokenId, since the variable type is now the interface)
- :llm-apps:skainet-cli already used the factory; unaffected.

Out of scope

The fromTokenizerJson path (SafeTensors tokenizer.json loading) has no Qwen-SafeTensors caller in this repo today — :llm-runtime:kgemma uses it for SentencePiece, which works correctly. Extending the upstream dispatch to fromTokenizerJson is a small follow-up if/when a Qwen SafeTensors path lands.

Test plan

4 new tests in UpstreamTokenizerAdapterTest: encode / decode delegation, single-token decode, null-bos/eos defaults + custom fallbacks.
No regressions: :llm-core:jvmTest, :llm-runtime:kllama:jvmTest, :llm-runtime:kgemma:jvmTest, :llm-apps:skainet-cli:build all pass.
CI green on PR.
Manual smoke (post-merge, requires real GGUF): run kllama-cli on Qwen2.5-0.5B-Instruct-Q8_0.gguf with --demo \"What is 2 + 2?\". Expected coherent answer or <tool_call> block instead of CJK / URL-encoded garbage from Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) #52.

Refs the closed #46. Phase-4 readiness checklist (issue #114) still has the parity finding open as a separate concern, but #52 is no longer the byte-level-BPE blocker on that list.

🤖 Generated with Claude Code

Closes #52. The local `GGUFTokenizer.encodeBPE()` is the SentencePiece greedy-by-score algorithm and is wrong for byte-level BPE (Qwen, GPT-2, Mistral-Nemo). It's missing all three pieces byte-level BPE actually needs: - `bytes_to_unicode()` mapping (so each input byte becomes a printable Unicode char that the Qwen vocab is keyed on, not raw UTF-8) - merge-rank-based merging (Qwen's `tokenizer.ggml.merges` list is in priority order — rank 0 wins, not highest score) - GPT-2 pretokenization regex (prevents cross-word merges) Symptom (per #52): chat-template prompts encode to broken token sequences, so the model generates CJK / URL-encoded / HTML-entity nonsense instead of answers or `<tool_call>` blocks. Upstream `sk.ainet.io.tokenizer.QwenByteLevelBpeTokenizer` (in SKaiNET 0.23.1, already pinned) is the correct seven-step pipeline matching HF transformers and llama.cpp. This PR routes Qwen-family GGUFs to it without changing the SentencePiece path. Three additions, one local interface change: 1. `UpstreamTokenizerAdapter` (`:llm-core` commonMain) — bridges upstream's `sk.ainet.io.tokenizer.Tokenizer` (nullable bos/eos, no single-token decode) into this repo's `sk.ainet.apps.llm.Tokenizer` (non-null bos/eos, single-token decode). Internal class. 2. `TokenizerFactory.fromGGUF(source)` now returns `Tokenizer` (interface) instead of `GGUFTokenizer` (concrete). Inside, it peeks `tokenizer.ggml.model` once and dispatches: - `gpt2` / `bpe` → upstream `QwenByteLevelBpeTokenizer` via adapter - everything else → existing local `GGUFTokenizer` (SentencePiece) `GGUFTokenizer.fromStreamingFields` is exposed (was private) so the factory can hand off pre-parsed fields without re-opening the source. 3. Two CLI callers migrated from `GGUFTokenizer.fromRandomAccessSource` to `TokenizerFactory.fromGGUF`: - `:llm-runtime:kllama` cli/Main.kt - `:llm-runtime:kllama` java/KLlamaJava.kt (also: `tokenizer.eosId` → `tokenizer.eosTokenId`, since variable type is now the interface) `:llm-apps:skainet-cli` already used the factory and is unaffected. The SafeTensors-via-`tokenizer.json` path (`fromTokenizerJson`) currently has no Qwen-SafeTensors caller in this repo — kgemma uses it for SentencePiece. Extending the upstream dispatch to `fromTokenizerJson` is a small follow-up if/when a Qwen SafeTensors path lands. Tests: 4 new in `UpstreamTokenizerAdapterTest` (encode/decode delegation, single-token decode, null-bos/eos fallbacks). All `:llm-core`, `:llm-runtime:kllama`, `:llm-runtime:kgemma`, `:llm-apps:skainet-cli` test suites pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal merged commit 7705d14 into develop May 4, 2026
2 checks passed

michalharakal deleted the fix/qwen-byte-level-bpe branch May 5, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tokenizer): route Qwen / GPT-2 BPE GGUFs to upstream byte-level BPE#115

fix(tokenizer): route Qwen / GPT-2 BPE GGUFs to upstream byte-level BPE#115
michalharakal merged 1 commit into
developfrom
fix/qwen-byte-level-bpe

michalharakal commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented May 4, 2026

Why

What changes

Out of scope

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant