[Feat] eliminate hard-coded record special tokens to be compatible with some models with custom vocab by Jaffe2718 · Pull Request #3616 · ggml-org/whisper.cpp

Jaffe2718 · 2026-01-21T10:27:29Z

…ls with different vocab size

Jaffe2718 · 2026-01-21T10:31:53Z

src/whisper.cpp

    {
-        int32_t n_vocab = 0;
-        read_safe(loader, n_vocab);
+        int32_t n_common_vocab = 0;


Some models, such as ggml-tiny.en.bin include special tokens such as <|endoftext|> in vocab, which is obviously incorrect, so these tokens should be excluded

Jaffe2718 · 2026-01-21T10:38:19Z

src/whisper.cpp

+        n_common_vocab -= n_special_token;                 // subtract the special tokens
+
+        vocab.n_vocab = model.hparams.n_vocab;             // all tokens, including special tokens
+
+        vocab.token_eot        = n_common_vocab;           // <|endoftext|>   50256 for en, 50257 for multilingual, others for custom model
+        vocab.token_sot        = n_common_vocab + 1;       // <|startoftranscribe|>
+        // [n_common_vocab + 2, vocab.n_vocab - 1507) are language tokens
+        // num_language = vocab.token_translate - vocab.token_sot - 1 = vocab.n_vocab - n_common_vocab - 1509
+        vocab.token_translate  = vocab.n_vocab - 1507;     // <|translate|>
+        vocab.token_transcribe = vocab.n_vocab - 1506;     // <|transcribe|>
+        vocab.token_solm       = vocab.n_vocab - 1505;     // <|startoflm|>
+        vocab.token_prev       = vocab.n_vocab - 1504;     // <|startofprev|>
+        vocab.token_nosp       = vocab.n_vocab - 1503;     // <|nospeech|>
+        vocab.token_not        = vocab.n_vocab - 1502;     // <|notimestamps|>
+        vocab.token_beg        = vocab.n_vocab - 1501;     // timestamps from <|0.00|> to <|30.00|>, 1501 tokens
+
+        if (n_common_vocab < model.hparams.n_vocab) {
+            WHISPER_LOG_INFO("%s: adding %d extra tokens\n", __func__, model.hparams.n_vocab - n_common_vocab);
+            for (int i = n_common_vocab; i < model.hparams.n_vocab; i++) {


I know this modifies the algorithm for native whisper to define vocab, but native whisper is also hardcoded. For some Whisper models like whisper-ja-anime-v0.3 on HuggingFace, if the vocab size is modified, it is not compatible with whisper.cpp because the index is off-bounds when loading the vocab of these models.

Eliminate hard-coded record special tokens to be compatible with mode…

2eb3e2d

…ls with different vocab size

Jaffe2718 commented Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] eliminate hard-coded record special tokens to be compatible with some models with custom vocab#3616

[Feat] eliminate hard-coded record special tokens to be compatible with some models with custom vocab#3616
Jaffe2718 wants to merge 1 commit intoggml-org:masterfrom
Jaffe2718:master

Jaffe2718 commented Jan 21, 2026

Uh oh!

Jaffe2718 Jan 21, 2026

Uh oh!

Jaffe2718 Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jaffe2718 commented Jan 21, 2026

Uh oh!

Jaffe2718 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Jaffe2718 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant