- 
                Notifications
    
You must be signed in to change notification settings  - Fork 13.5k
 
Closed
Description
This issue follows on from the discussions we had at the end of @strutive07 's PR which added support for tokenizer.json, here: #3633
Summary
Llama and Mistral models GGUF converted from tokenizer.json experience an issue with newlines, printing <0x0A instead of \n.  The issue does not exist when tokenizer.model is used for the same model.
This represents an issue for some new fine tunes which do not include tokenizer.model.  Sometimes this is simply a mistake, and the base model file can be used. But in some cases the models have extended or changed the vocab in tokenizer.json, and a new SPM model would need to be created.  (Something that I've not yet been able to figure out how to do.)
Steps to reproduce
- Download any Llama or Mistral 7B repo which contains 
tokenizer.modelandtokenizer.json: 
pip3 install --upgrade 'huggingface-hub>=0.18'     # if not installed
huggingface-cli download mistralai/Mistral-7B-v0.1 --local-dir test-mistral  --local-dir-use-symlinks False
- Run 
convert.pyon it, and verify that output is as expected. Becausetokenizer.modelis present, it will be used in preference totokenizer.json, and no issue will exist. 
$ ls -al /workspace/test-mistral/tokenizer.model
-rw-rw-r-- 1 quant quant 482K Dec 24 17:14 /workspace/test-mistral/tokenizer.model
$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/with-tokenizer.model.fp16.gguf
$ ./main -m /workspace/test-mistral/with-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0
 A haiku example is 5-7-5 syllables.
The first line has five syllables, the second line has seven syllables and the
- Remove 
tokenizer.modelto forcetokenizer.jsonto be used and re-runconvert.py 
$ mv /workspace/test-mistral/tokenizer.model /workspace/test-mistral/dead.tokenizer.model
$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/no-tokenizer.model.fp16.gguf
- Test inference and note that 
\nis now represented as<0x0A>in the output: 
$ ./main -m /workspace/test-mistral/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0
 A haiku example is 5-7-5 syllables.<0x0A><0x0A>The first line has five syllables, the second line has seven syllables and the
Testing the same using Hugging Face transformers does not show an issue:
In [1]: import os
   ...: from transformers import AutoTokenizer
   ...: print ("tokenizer.model exists:", os.path.exists("/workspace/test-mistral/tokenizer.model"))
   ...: tokenizer = AutoTokenizer.from_pretrained("/workspace/test-mistral/")
   ...: encoded = tokenizer(""" A haiku example is 5-7-5 syllables.
   ...:
   ...: The first line has five syllables, the second line has seven syllables and the""")
   ...: print(f"Tokens: {encoded.input_ids}")
   ...: print(f"Decoded again: '{tokenizer.decode(encoded.input_ids)}'")
tokenizer.model exists: False
Tokens: [1, 28705, 330, 3631, 23550, 2757, 349, 28705, 28782, 28733, 28787, 28733, 28782, 5747, 584, 2561, 28723, 13, 13, 1014, 907, 1407, 659, 3359, 5747, 584, 2561, 28725, 272, 1676, 1407, 659, 6671, 5747, 584, 2561, 304, 272]
Decoded again: '<s>  A haiku example is 5-7-5 syllables.
The first line has five syllables, the second line has seven syllables and the'
In [2]: tokenizer.decode(13)
Out[2]: '\n'
Llama example
$ huggingface-cli download meta-llama/Llama-2-7b-chat-hf  --local-dir test-llama2  --local-dir-use-symlinks False
$ mv test-llama2/tokenizer.model test-llama2/dead.tokenizer.model
$ python3 ./convert.py /workspace/test-llama2 --outtype f16 --outfile /workspace/test-llama2/no-tokenizer.model.fp16.gguf
$ ./main -m /workspace/test-llama2/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0
A haiku example is 5 syllables, 7 syllables, and 5 syllables.<0x0A>A haiku is a traditional form of Japanese poetry that
afrideva and strutive07
Metadata
Metadata
Assignees
Labels
No labels