Skip to content

fix(tokenizer): route Qwen / GPT-2 BPE GGUFs to upstream byte-level BPE#115

Merged
michalharakal merged 1 commit into
developfrom
fix/qwen-byte-level-bpe
May 4, 2026
Merged

fix(tokenizer): route Qwen / GPT-2 BPE GGUFs to upstream byte-level BPE#115
michalharakal merged 1 commit into
developfrom
fix/qwen-byte-level-bpe

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Closes #52.

Why

The local GGUFTokenizer.encodeBPE() is the SentencePiece greedy-by-score algorithm. It's the wrong algorithm for byte-level BPE (Qwen / GPT-2 / Mistral-Nemo): missing the bytes_to_unicode() mapping, missing merge-rank-based merging, and missing GPT-2 pretokenization regex. Per #52 the symptom is chat-template prompts encoding to broken token sequences, so Qwen3 generates CJK / URL-encoded / HTML-entity nonsense in tool-calling and chat modes.

The fix is already in upstream SKaiNET 0.23.1sk.ainet.io.tokenizer.QwenByteLevelBpeTokenizer implements the full seven-step HF / llama.cpp pipeline. This PR routes Qwen-family GGUFs there. Pinned SKaiNET version is unchanged.

What changes

  • New: UpstreamTokenizerAdapter in :llm-core (commonMain) — bridges upstream's sk.ainet.io.tokenizer.Tokenizer (nullable bos/eos, no single-token decode) into this repo's sk.ainet.apps.llm.Tokenizer interface (non-null bos/eos, single-token decode).
  • TokenizerFactory.fromGGUF(source) now returns Tokenizer (interface) instead of GGUFTokenizer (concrete). Inside, peeks tokenizer.ggml.model once and dispatches:
    • gpt2 / bpe → upstream QwenByteLevelBpeTokenizer via adapter ✅ correct
    • everything else → existing local GGUFTokenizer (SentencePiece path, unchanged)
  • GGUFTokenizer.fromStreamingFields — was private, now public so the factory can hand off pre-parsed fields without re-opening the source. Doc warns external callers to prefer the factory.
  • Two CLI callers migrated from GGUFTokenizer.fromRandomAccessSource(source)TokenizerFactory.fromGGUF(source):
    • :llm-runtime:kllama cli/Main.kt
    • :llm-runtime:kllama java/KLlamaJava.kt (also: tokenizer.eosIdtokenizer.eosTokenId, since the variable type is now the interface)
    • :llm-apps:skainet-cli already used the factory; unaffected.

Out of scope

The fromTokenizerJson path (SafeTensors tokenizer.json loading) has no Qwen-SafeTensors caller in this repo today — :llm-runtime:kgemma uses it for SentencePiece, which works correctly. Extending the upstream dispatch to fromTokenizerJson is a small follow-up if/when a Qwen SafeTensors path lands.

Test plan

  • 4 new tests in UpstreamTokenizerAdapterTest: encode / decode delegation, single-token decode, null-bos/eos defaults + custom fallbacks.
  • No regressions: :llm-core:jvmTest, :llm-runtime:kllama:jvmTest, :llm-runtime:kgemma:jvmTest, :llm-apps:skainet-cli:build all pass.
  • CI green on PR.
  • Manual smoke (post-merge, requires real GGUF): run kllama-cli on Qwen2.5-0.5B-Instruct-Q8_0.gguf with --demo \"What is 2 + 2?\". Expected coherent answer or <tool_call> block instead of CJK / URL-encoded garbage from Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) #52.

Refs the closed #46. Phase-4 readiness checklist (issue #114) still has the parity finding open as a separate concern, but #52 is no longer the byte-level-BPE blocker on that list.

🤖 Generated with Claude Code

Closes #52.

The local `GGUFTokenizer.encodeBPE()` is the SentencePiece greedy-by-score
algorithm and is wrong for byte-level BPE (Qwen, GPT-2, Mistral-Nemo).
It's missing all three pieces byte-level BPE actually needs:
- `bytes_to_unicode()` mapping (so each input byte becomes a printable
  Unicode char that the Qwen vocab is keyed on, not raw UTF-8)
- merge-rank-based merging (Qwen's `tokenizer.ggml.merges` list is in
  priority order — rank 0 wins, not highest score)
- GPT-2 pretokenization regex (prevents cross-word merges)

Symptom (per #52): chat-template prompts encode to broken token sequences,
so the model generates CJK / URL-encoded / HTML-entity nonsense instead of
answers or `<tool_call>` blocks.

Upstream `sk.ainet.io.tokenizer.QwenByteLevelBpeTokenizer` (in SKaiNET
0.23.1, already pinned) is the correct seven-step pipeline matching HF
transformers and llama.cpp. This PR routes Qwen-family GGUFs to it
without changing the SentencePiece path.

Three additions, one local interface change:

1. `UpstreamTokenizerAdapter` (`:llm-core` commonMain) — bridges upstream's
   `sk.ainet.io.tokenizer.Tokenizer` (nullable bos/eos, no single-token
   decode) into this repo's `sk.ainet.apps.llm.Tokenizer` (non-null bos/eos,
   single-token decode). Internal class.

2. `TokenizerFactory.fromGGUF(source)` now returns `Tokenizer` (interface)
   instead of `GGUFTokenizer` (concrete). Inside, it peeks
   `tokenizer.ggml.model` once and dispatches:
     - `gpt2` / `bpe` → upstream `QwenByteLevelBpeTokenizer` via adapter
     - everything else → existing local `GGUFTokenizer` (SentencePiece)
   `GGUFTokenizer.fromStreamingFields` is exposed (was private) so the
   factory can hand off pre-parsed fields without re-opening the source.

3. Two CLI callers migrated from `GGUFTokenizer.fromRandomAccessSource` to
   `TokenizerFactory.fromGGUF`:
     - `:llm-runtime:kllama` cli/Main.kt
     - `:llm-runtime:kllama` java/KLlamaJava.kt (also: `tokenizer.eosId`
       → `tokenizer.eosTokenId`, since variable type is now the interface)
   `:llm-apps:skainet-cli` already used the factory and is unaffected.

The SafeTensors-via-`tokenizer.json` path (`fromTokenizerJson`) currently
has no Qwen-SafeTensors caller in this repo — kgemma uses it for
SentencePiece. Extending the upstream dispatch to `fromTokenizerJson` is
a small follow-up if/when a Qwen SafeTensors path lands.

Tests: 4 new in `UpstreamTokenizerAdapterTest` (encode/decode delegation,
single-token decode, null-bos/eos fallbacks). All `:llm-core`,
`:llm-runtime:kllama`, `:llm-runtime:kgemma`, `:llm-apps:skainet-cli`
test suites pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) Wire QwenNetworkLoader into CLI for proper Qwen3 inference

1 participant