fix(tokenizer): route Qwen / GPT-2 BPE GGUFs to upstream byte-level BPE#115
Merged
Conversation
Closes #52. The local `GGUFTokenizer.encodeBPE()` is the SentencePiece greedy-by-score algorithm and is wrong for byte-level BPE (Qwen, GPT-2, Mistral-Nemo). It's missing all three pieces byte-level BPE actually needs: - `bytes_to_unicode()` mapping (so each input byte becomes a printable Unicode char that the Qwen vocab is keyed on, not raw UTF-8) - merge-rank-based merging (Qwen's `tokenizer.ggml.merges` list is in priority order — rank 0 wins, not highest score) - GPT-2 pretokenization regex (prevents cross-word merges) Symptom (per #52): chat-template prompts encode to broken token sequences, so the model generates CJK / URL-encoded / HTML-entity nonsense instead of answers or `<tool_call>` blocks. Upstream `sk.ainet.io.tokenizer.QwenByteLevelBpeTokenizer` (in SKaiNET 0.23.1, already pinned) is the correct seven-step pipeline matching HF transformers and llama.cpp. This PR routes Qwen-family GGUFs to it without changing the SentencePiece path. Three additions, one local interface change: 1. `UpstreamTokenizerAdapter` (`:llm-core` commonMain) — bridges upstream's `sk.ainet.io.tokenizer.Tokenizer` (nullable bos/eos, no single-token decode) into this repo's `sk.ainet.apps.llm.Tokenizer` (non-null bos/eos, single-token decode). Internal class. 2. `TokenizerFactory.fromGGUF(source)` now returns `Tokenizer` (interface) instead of `GGUFTokenizer` (concrete). Inside, it peeks `tokenizer.ggml.model` once and dispatches: - `gpt2` / `bpe` → upstream `QwenByteLevelBpeTokenizer` via adapter - everything else → existing local `GGUFTokenizer` (SentencePiece) `GGUFTokenizer.fromStreamingFields` is exposed (was private) so the factory can hand off pre-parsed fields without re-opening the source. 3. Two CLI callers migrated from `GGUFTokenizer.fromRandomAccessSource` to `TokenizerFactory.fromGGUF`: - `:llm-runtime:kllama` cli/Main.kt - `:llm-runtime:kllama` java/KLlamaJava.kt (also: `tokenizer.eosId` → `tokenizer.eosTokenId`, since variable type is now the interface) `:llm-apps:skainet-cli` already used the factory and is unaffected. The SafeTensors-via-`tokenizer.json` path (`fromTokenizerJson`) currently has no Qwen-SafeTensors caller in this repo — kgemma uses it for SentencePiece. Extending the upstream dispatch to `fromTokenizerJson` is a small follow-up if/when a Qwen SafeTensors path lands. Tests: 4 new in `UpstreamTokenizerAdapterTest` (encode/decode delegation, single-token decode, null-bos/eos fallbacks). All `:llm-core`, `:llm-runtime:kllama`, `:llm-runtime:kgemma`, `:llm-apps:skainet-cli` test suites pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 4, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #52.
Why
The local
GGUFTokenizer.encodeBPE()is the SentencePiece greedy-by-score algorithm. It's the wrong algorithm for byte-level BPE (Qwen / GPT-2 / Mistral-Nemo): missing thebytes_to_unicode()mapping, missing merge-rank-based merging, and missing GPT-2 pretokenization regex. Per #52 the symptom is chat-template prompts encoding to broken token sequences, so Qwen3 generates CJK / URL-encoded / HTML-entity nonsense in tool-calling and chat modes.The fix is already in upstream SKaiNET 0.23.1 —
sk.ainet.io.tokenizer.QwenByteLevelBpeTokenizerimplements the full seven-step HF / llama.cpp pipeline. This PR routes Qwen-family GGUFs there. Pinned SKaiNET version is unchanged.What changes
UpstreamTokenizerAdapterin:llm-core(commonMain) — bridges upstream'ssk.ainet.io.tokenizer.Tokenizer(nullable bos/eos, no single-token decode) into this repo'ssk.ainet.apps.llm.Tokenizerinterface (non-null bos/eos, single-token decode).TokenizerFactory.fromGGUF(source)now returnsTokenizer(interface) instead ofGGUFTokenizer(concrete). Inside, peekstokenizer.ggml.modelonce and dispatches:gpt2/bpe→ upstreamQwenByteLevelBpeTokenizervia adapter ✅ correctGGUFTokenizer(SentencePiece path, unchanged)GGUFTokenizer.fromStreamingFields— was private, nowpublicso the factory can hand off pre-parsed fields without re-opening the source. Doc warns external callers to prefer the factory.GGUFTokenizer.fromRandomAccessSource(source)→TokenizerFactory.fromGGUF(source)::llm-runtime:kllamacli/Main.kt:llm-runtime:kllamajava/KLlamaJava.kt(also:tokenizer.eosId→tokenizer.eosTokenId, since the variable type is now the interface):llm-apps:skainet-clialready used the factory; unaffected.Out of scope
The
fromTokenizerJsonpath (SafeTensors tokenizer.json loading) has no Qwen-SafeTensors caller in this repo today —:llm-runtime:kgemmauses it for SentencePiece, which works correctly. Extending the upstream dispatch tofromTokenizerJsonis a small follow-up if/when a Qwen SafeTensors path lands.Test plan
UpstreamTokenizerAdapterTest: encode / decode delegation, single-token decode, null-bos/eos defaults + custom fallbacks.:llm-core:jvmTest,:llm-runtime:kllama:jvmTest,:llm-runtime:kgemma:jvmTest,:llm-apps:skainet-cli:buildall pass.kllama-clionQwen2.5-0.5B-Instruct-Q8_0.ggufwith--demo \"What is 2 + 2?\". Expected coherent answer or<tool_call>block instead of CJK / URL-encoded garbage from Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) #52.Refs the closed #46. Phase-4 readiness checklist (issue #114) still has the parity finding open as a separate concern, but #52 is no longer the byte-level-BPE blocker on that list.
🤖 Generated with Claude Code