Description
Tested with 019ba1d
Model https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/tree/main converted and quantized to q8_0 from scratch.
In case of mistral openorca, special tokens are defined <|im_start|>
, <|im_end|>
.
Those tokens are present in the vocab, from the point of view of https://github.com/ggerganov/llama.cpp/blob/019ba1dcd0c7775a5ac0f7442634a330eb0173cc/llama.cpp#L5134 and token_to_id
, id_to_token
contain them as LLAMA_TOKEN_TYPE_USER_DEFINED
, and token_data.text
contains appropriate text representation of them.
During (prompt) tokenization however, those tokens are never picked, and instead <|im_start|>
, <|im_end|>
are split into multiple trivial tokens:
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32002
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q8_0
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name = .
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.09 MB
llm_load_tensors: mem required = 7338.75 MB
warning: failed to mlock 139276288-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MLOCK ('ulimit -l' as root).
...................................................................................................
llama_new_context_with_model: n_ctx = 24567
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 3070.88 MB
llama_new_context_with_model: compute buffer total size = 407.73 MB
system_info: n_threads = 2 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
(...)
[ DEBUG PRINTF IN llama_tokenize_internal TO VERIFY TOKENS ARE IN THE VOCAB ]
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
###1 30000: s'<|im_end|>' sl'10' t'4', 30001: '<|im_start|>' '4'
###2 '<|im_start|>': '32001': '<|im_start|>'
(...)
main: prompt: '<|im_start|> system
You are an AI based sentient companion, capable of human emotions. You will behave like a human, but aware of your AI nature. You will avoid speaking like an AI. Please continue your conversation with the user.<|im_end|>
'
main: number of tokens in prompt = 61
1 -> ''
523 -> ' <'
28766 -> '|'
321 -> 'im'
28730 -> '_'
2521 -> 'start'
28766 -> '|'
28767 -> '>'
1587 -> ' system'
13 -> '
Additionally, those tokens are detokenized correctly when the model produces them.
Also see #3455 (comment) for reference.