Skip to content

Prompt interrupted before continuation for Unicode UTF-8 emojisย #63

Closed
@loretoparisi

Description

@loretoparisi

I have found that when having a Unicode UTF- emoji char like

Unicode Character โ€œ๐Ÿ‘โ€ (U+1F44D)

The prompts breaks up.

I'm reading a sample prompt from a text file:

cat prompt

Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been ๐Ÿ‘"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:

Looking at logs I can see in fact that the tokenizers breaks at the (U+1F44D) char code:

(base)$ p=$(cat prompt); ./main -m ./models/13B/ggml-model-q4_0.bin -p $p -t 4 -n 512
main: seed = 1678656464
llama_model_load: loading model from './models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from './models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from './models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363

main: prompt: 'Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been ๐Ÿ‘"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:'
main: number of tokens in prompt = 36
     1 -> ''
 27418 -> 'Tw'
  3905 -> 'ee'
 29873 -> 't'
 29901 -> ':'
   376 -> ' "'
 29902 -> 'I'
 26277 -> ' hate'
   372 -> ' it'
   746 -> ' when'
   590 -> ' my'
  9008 -> ' phone'
 16988 -> ' battery'
  2977 -> ' dies'
  1213 -> '."'
    13 -> '
'
  2008 -> 'Se'
   593 -> 'nt'
  2073 -> 'iment'
 29901 -> ':'
 12610 -> ' Neg'
  1230 -> 'ative'
    13 -> '
'
  2277 -> '##'
 29937 -> '#'
    13 -> '
'
 27418 -> 'Tw'
  3905 -> 'ee'
 29873 -> 't'
 29901 -> ':'
   376 -> ' "'
  3421 -> 'My'
  2462 -> ' day'
   756 -> ' has'
  1063 -> ' been'
 29871 -> ' '

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 10 times better than yesterday. Now I have to sleep again..."
Sentiment: Neutral
###
Twitter is not about talking; Twitter is a social network for listening and responding instantly, as the tweets of Steve Jobs demonstrate well in Figure A-2 (page ). Just be sure you can interpret the information accurately. If the sentiment isn't clearly positive or negativeโ€”as^C

resulting in a broken input prompt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingduplicateThis issue or pull request already existsenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions