Auralis Tokenizer (Helix v2)

A 200,000-token SentencePiece tokenizer (Unigram) built for the Auralis / Helix German-primary language model. It is trained for German, English, and code in one vocabulary, with byte-fallback (zero unknowns) and a built-in chat template.

If you are building a German or German/English LLM and want a ready, low-fertility 200k tokenizer instead of training your own, take this one.

Why it exists

German is fertility-expensive for English-tuned tokenizers (long compound words get shredded into many tokens). A vocabulary trained with German as a first-class language encodes German text more densely → effectively more context per window and cheaper/faster inference per sentence. This tokenizer was trained on ~15.5 GB of mixed cleaned corpus (German + English + code) to get that density without giving up English or code.

Measured fertility (lower = denser; from `tokenizer/quality_report.md`)

Language	Tokens / 100 words	Unknown rate
German	133.8	0.000000
English	123.0	0.000000
Code	313.6 tokens / KB	0.000000

Chat-template round-trip: byte-exact. Zero unknown tokens (byte-fallback).

What's in here

tokenizer/
  helix_v2_tokenizer.model     # the SentencePiece model — load this
  helix_v2_tokenizer.vocab     # human-readable vocab (piece \t score)
  quality_report.md            # fertility + round-trip report
  training_manifest.yaml       # full training record (args, sizes, timing)
configs/tokenizer/helix_v2.yaml          # training config
scripts/tokenizer/train_tokenizer.py     # reproduce the tokenizer
scripts/eval/tokenizer_fertility.py      # benchmark fertility vs tiktoken / Llama-3

Usage

import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer/helix_v2_tokenizer.model")

ids = sp.EncodeAsIds("Die Donaudampfschifffahrtsgesellschaft fährt heute nicht.")
print(len(ids), ids)
print(sp.DecodeIds(ids))            # byte-exact round-trip
print(sp.GetPieceSize())            # 200000

Special tokens (chat template)

The first user-defined pieces are reserved for a chat template and control:

<pad> <unk> <s> </s>
<|system|> <|user|> <|assistant|> <|end|>
<think> </think>

(IDs 0-3 are pad/unk/bos/eos; the chat-template pieces follow. See training_manifest.yaml for the full list and exact ids.)

Reproduce it

pip install -r requirements.txt
# point the config/CLI at your own cleaned corpus, then:
python scripts/tokenizer/train_tokenizer.py --help

Trained as a SentencePiece Unigram model, character_coverage=0.99995, byte_fallback=true, normalization_rule_name=identity, remove_extra_whitespaces=false (whitespace preserved — matters for code).

Benchmark fertility against other tokenizers

python scripts/eval/tokenizer_fertility.py --tokenizer tokenizer/helix_v2_tokenizer.model
# optional comparisons load if installed: tiktoken (o200k/cl100k), Llama-3 via transformers

Requirements

Python 3.10+, sentencepiece (load/train), PyYAML (training config)
Optional for the fertility benchmark: tiktoken, transformers

License

Apache-2.0. The tokenizer model is released for reuse; it does not contain the training corpus. Built as part of the Auralis / Helix project.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs/tokenizer		configs/tokenizer
scripts		scripts
tokenizer		tokenizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auralis Tokenizer (Helix v2)

Why it exists

Measured fertility (lower = denser; from `tokenizer/quality_report.md`)

What's in here

Usage

Special tokens (chat template)

Reproduce it

Benchmark fertility against other tokenizers

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Auralis Tokenizer (Helix v2)

Why it exists

Measured fertility (lower = denser; from tokenizer/quality_report.md)

What's in here

Usage

Special tokens (chat template)

Reproduce it

Benchmark fertility against other tokenizers

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Measured fertility (lower = denser; from `tokenizer/quality_report.md`)

Packages