`convert.py`: Mistral models converted from `tokenizer.json` display `<0x0A>` instead of newlines.

This issue follows on from the discussions we had at the end of @strutive07 's PR which added support for `tokenizer.json`, here: https://github.com/ggerganov/llama.cpp/pull/3633

### Summary

Llama and Mistral models GGUF converted from `tokenizer.json` experience an issue with newlines, printing `<0x0A` instead of `\n`.  The issue does not exist when `tokenizer.model` is used for the same model.

This represents an issue for some new fine tunes which do not include `tokenizer.model`.  Sometimes this is simply a mistake, and the base model file can be used. But in some cases the models have extended or changed the vocab in `tokenizer.json`, and a new SPM model would need to be created.  (Something that I've not yet been able to figure out how to do.)

### Steps to reproduce

1. Download any Llama or Mistral 7B repo which contains `tokenizer.model` and `tokenizer.json`:
```
pip3 install --upgrade 'huggingface-hub>=0.18'     # if not installed
huggingface-cli download mistralai/Mistral-7B-v0.1 --local-dir test-mistral  --local-dir-use-symlinks False
```
2. Run `convert.py` on it, and verify that output is as expected. Because `tokenizer.model` is present, it will be used in preference to `tokenizer.json`, and no issue will exist.
```
$ ls -al /workspace/test-mistral/tokenizer.model
-rw-rw-r-- 1 quant quant 482K Dec 24 17:14 /workspace/test-mistral/tokenizer.model

$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/with-tokenizer.model.fp16.gguf

$ ./main -m /workspace/test-mistral/with-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

 A haiku example is 5-7-5 syllables.

The first line has five syllables, the second line has seven syllables and the
```
4. Remove `tokenizer.model` to force `tokenizer.json` to be used and re-run `convert.py`
```
$ mv /workspace/test-mistral/tokenizer.model /workspace/test-mistral/dead.tokenizer.model

$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/no-tokenizer.model.fp16.gguf
```
6. Test inference and note that `\n` is now represented as `<0x0A>` in the output:
```
$ ./main -m /workspace/test-mistral/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

 A haiku example is 5-7-5 syllables.<0x0A><0x0A>The first line has five syllables, the second line has seven syllables and the
```

## Testing the same using Hugging Face transformers does not show an issue:

```
In [1]: import os
   ...: from transformers import AutoTokenizer
   ...: print ("tokenizer.model exists:", os.path.exists("/workspace/test-mistral/tokenizer.model"))
   ...: tokenizer = AutoTokenizer.from_pretrained("/workspace/test-mistral/")
   ...: encoded = tokenizer(""" A haiku example is 5-7-5 syllables.
   ...:
   ...: The first line has five syllables, the second line has seven syllables and the""")
   ...: print(f"Tokens: {encoded.input_ids}")
   ...: print(f"Decoded again: '{tokenizer.decode(encoded.input_ids)}'")
tokenizer.model exists: False
Tokens: [1, 28705, 330, 3631, 23550, 2757, 349, 28705, 28782, 28733, 28787, 28733, 28782, 5747, 584, 2561, 28723, 13, 13, 1014, 907, 1407, 659, 3359, 5747, 584, 2561, 28725, 272, 1676, 1407, 659, 6671, 5747, 584, 2561, 304, 272]
Decoded again: '<s>  A haiku example is 5-7-5 syllables.

The first line has five syllables, the second line has seven syllables and the'

In [2]: tokenizer.decode(13)
Out[2]: '\n'
```

## Llama example

```
$ huggingface-cli download meta-llama/Llama-2-7b-chat-hf  --local-dir test-llama2  --local-dir-use-symlinks False

$ mv test-llama2/tokenizer.model test-llama2/dead.tokenizer.model

$ python3 ./convert.py /workspace/test-llama2 --outtype f16 --outfile /workspace/test-llama2/no-tokenizer.model.fp16.gguf

$ ./main -m /workspace/test-llama2/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

A haiku example is 5 syllables, 7 syllables, and 5 syllables.<0x0A>A haiku is a traditional form of Japanese poetry that
```

CC @ArthurZucker



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`convert.py`: Mistral models converted from `tokenizer.json` display `<0x0A>` instead of newlines. #4622

Summary

Steps to reproduce

Testing the same using Hugging Face transformers does not show an issue:

Llama example

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

convert.py: Mistral models converted from tokenizer.json display <0x0A> instead of newlines. #4622

Description

Summary

Steps to reproduce

Testing the same using Hugging Face transformers does not show an issue:

Llama example

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`convert.py`: Mistral models converted from `tokenizer.json` display `<0x0A>` instead of newlines. #4622