[Feat] eliminate hard-coded record special tokens to be compatible with some models with custom vocab#3616
Open
Jaffe2718 wants to merge 1 commit intoggml-org:masterfrom
Open
[Feat] eliminate hard-coded record special tokens to be compatible with some models with custom vocab#3616Jaffe2718 wants to merge 1 commit intoggml-org:masterfrom
Jaffe2718 wants to merge 1 commit intoggml-org:masterfrom
Conversation
…ls with different vocab size
Jaffe2718
commented
Jan 21, 2026
| { | ||
| int32_t n_vocab = 0; | ||
| read_safe(loader, n_vocab); | ||
| int32_t n_common_vocab = 0; |
Author
There was a problem hiding this comment.
Some models, such as ggml-tiny.en.bin include special tokens such as <|endoftext|> in vocab, which is obviously incorrect, so these tokens should be excluded
Jaffe2718
commented
Jan 21, 2026
Comment on lines
+1630
to
+1648
| n_common_vocab -= n_special_token; // subtract the special tokens | ||
|
|
||
| vocab.n_vocab = model.hparams.n_vocab; // all tokens, including special tokens | ||
|
|
||
| vocab.token_eot = n_common_vocab; // <|endoftext|> 50256 for en, 50257 for multilingual, others for custom model | ||
| vocab.token_sot = n_common_vocab + 1; // <|startoftranscribe|> | ||
| // [n_common_vocab + 2, vocab.n_vocab - 1507) are language tokens | ||
| // num_language = vocab.token_translate - vocab.token_sot - 1 = vocab.n_vocab - n_common_vocab - 1509 | ||
| vocab.token_translate = vocab.n_vocab - 1507; // <|translate|> | ||
| vocab.token_transcribe = vocab.n_vocab - 1506; // <|transcribe|> | ||
| vocab.token_solm = vocab.n_vocab - 1505; // <|startoflm|> | ||
| vocab.token_prev = vocab.n_vocab - 1504; // <|startofprev|> | ||
| vocab.token_nosp = vocab.n_vocab - 1503; // <|nospeech|> | ||
| vocab.token_not = vocab.n_vocab - 1502; // <|notimestamps|> | ||
| vocab.token_beg = vocab.n_vocab - 1501; // timestamps from <|0.00|> to <|30.00|>, 1501 tokens | ||
|
|
||
| if (n_common_vocab < model.hparams.n_vocab) { | ||
| WHISPER_LOG_INFO("%s: adding %d extra tokens\n", __func__, model.hparams.n_vocab - n_common_vocab); | ||
| for (int i = n_common_vocab; i < model.hparams.n_vocab; i++) { |
Author
There was a problem hiding this comment.
I know this modifies the algorithm for native whisper to define vocab, but native whisper is also hardcoded. For some Whisper models like whisper-ja-anime-v0.3 on HuggingFace, if the vocab size is modified, it is not compatible with whisper.cpp because the index is off-bounds when loading the vocab of these models.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix #3392