Tokenizer not picking the right tokens ( mistral openorca )

Tested with 019ba1dcd0c7775a5ac0f7442634a330eb0173cc

Model https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/tree/main converted and quantized to q8_0 from scratch.

In case of mistral openorca, special tokens are defined `<|im_start|>`, `<|im_end|>`.

Those tokens are present in the vocab, from the point of view of https://github.com/ggerganov/llama.cpp/blob/019ba1dcd0c7775a5ac0f7442634a330eb0173cc/llama.cpp#L5134 and `token_to_id`, `id_to_token` contain them as `LLAMA_TOKEN_TYPE_USER_DEFINED`, and `token_data.text` contains appropriate text representation of them.

During (prompt) tokenization however, those tokens are never picked, and instead `<|im_start|>`, `<|im_end|>` are split into multiple trivial tokens:

```
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW) 
llm_load_print_meta: general.name   = .
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: mem required  = 7338.75 MB
warning: failed to mlock 139276288-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MLOCK ('ulimit -l' as root).
...................................................................................................
llama_new_context_with_model: n_ctx      = 24567
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 3070.88 MB
llama_new_context_with_model: compute buffer total size = 407.73 MB

system_info: n_threads = 2 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
(...)
[ DEBUG PRINTF IN llama_tokenize_internal TO VERIFY TOKENS ARE IN THE VOCAB ]
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
(...)

main: prompt: '<|im_start|> system
You are an AI based sentient companion, capable of human emotions. You will behave like a human, but aware of your AI nature. You will avoid speaking like an AI. Please continue your conversation with the user.<|im_end|>
'
main: number of tokens in prompt = 61
     1 -> ''
   523 -> ' <'
 28766 -> '|'
   321 -> 'im'
 28730 -> '_'
  2521 -> 'start'
 28766 -> '|'
 28767 -> '>'
  1587 -> ' system'
    13 -> '

```

Additionally, those tokens are detokenized correctly when the model produces them.

Also see https://github.com/ggerganov/llama.cpp/pull/3455#issuecomment-1745843824 for reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer not picking the right tokens ( mistral openorca ) #3475

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenizer not picking the right tokens ( mistral openorca ) #3475

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions