Skip to content

Tokenizer not picking the right tokens ( mistral openorca ) #3475

Closed
@staviq

Description

@staviq

Tested with 019ba1d

Model https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/tree/main converted and quantized to q8_0 from scratch.

In case of mistral openorca, special tokens are defined <|im_start|>, <|im_end|>.

Those tokens are present in the vocab, from the point of view of https://github.com/ggerganov/llama.cpp/blob/019ba1dcd0c7775a5ac0f7442634a330eb0173cc/llama.cpp#L5134 and token_to_id, id_to_token contain them as LLAMA_TOKEN_TYPE_USER_DEFINED, and token_data.text contains appropriate text representation of them.

During (prompt) tokenization however, those tokens are never picked, and instead <|im_start|>, <|im_end|> are split into multiple trivial tokens:

llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW) 
llm_load_print_meta: general.name   = .
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: mem required  = 7338.75 MB
warning: failed to mlock 139276288-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MLOCK ('ulimit -l' as root).
...................................................................................................
llama_new_context_with_model: n_ctx      = 24567
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 3070.88 MB
llama_new_context_with_model: compute buffer total size = 407.73 MB

system_info: n_threads = 2 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
(...)
[ DEBUG PRINTF IN llama_tokenize_internal TO VERIFY TOKENS ARE IN THE VOCAB ]
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
(...)

main: prompt: '<|im_start|> system
You are an AI based sentient companion, capable of human emotions. You will behave like a human, but aware of your AI nature. You will avoid speaking like an AI. Please continue your conversation with the user.<|im_end|>
'
main: number of tokens in prompt = 61
     1 -> ''
   523 -> ' <'
 28766 -> '|'
   321 -> 'im'
 28730 -> '_'
  2521 -> 'start'
 28766 -> '|'
 28767 -> '>'
  1587 -> ' system'
    13 -> '

Additionally, those tokens are detokenized correctly when the model produces them.

Also see #3455 (comment) for reference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions