#5821 breaks convert.py for models with additional vocabulary, resulting in garbage output

#5821 has changed the behaviour of how to convert models from HF with additional vocabulary, when the `--pad-vocab` option is needed, and the resulting conversion is unusable in llama.cpp and other clients such as LM Studio. I have tested it on multiple models, and others have reported the same. One example of such a model is [WhiteRabbitNeo/WhiteRabbitNeo-33B-v1.5](https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-33B-v1.5). Here is the error output:

```
./llama.cpp/main -m WhiteRabbitNeo/WhiteRabbitNeo-33B-v1.5/ggml-model-f16.gguf -n 128
(...)
**Error**
libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found
zsh: abort      ./llama.cpp/main -m WhiteRabbitNeo/WhiteRabbitNeo-33B-v1.5/ggml-model-f16.gguf -n 128
```


This happens both when only  `--pad-vocab` or `--pad-vocab --vocab-type hfft` are used during conversion. The other vocab types, `bpe` and `spm`, result in a failed conversion.

With `--pad-vocab --vocab-type bpe`:
```
Found vocab files: {'spm': None, 'bpe': None, 'hfft': PosixPath('WhiteRabbitNeo-33B-v1.5/tokenizer.json')}
Traceback (most recent call last):
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1479, in <module>
    main()
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1447, in main
    vocab, special_vocab = vocab_factory.load_vocab(args.vocab_type.split(","), model_parent_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1323, in load_vocab
    vocab_type, path = self._select_file(vocab_types)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1310, in _select_file
    raise FileNotFoundError(f"Could not find any of {[self._FILES[vt] for vt in vocab_types]}")
FileNotFoundError: Could not find any of ['vocab.json']
```


With `--pad-vocab --vocab-type spm`:
```
Found vocab files: {'spm': None, 'bpe': None, 'hfft': PosixPath('WhiteRabbitNeo-33B-v1.5/tokenizer.json')}
Traceback (most recent call last):
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1479, in <module>
    main()
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1447, in main
    vocab, special_vocab = vocab_factory.load_vocab(args.vocab_type.split(","), model_parent_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1323, in load_vocab
    vocab_type, path = self._select_file(vocab_types)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1310, in _select_file
    raise FileNotFoundError(f"Could not find any of {[self._FILES[vt] for vt in vocab_types]}")
FileNotFoundError: Could not find any of ['tokenizer.model']
```


Previously, the required options were `--pad-vocab --vocab-type bpe`, and everything worked as expected. I have tested previous versions, and all the ones up to this one works. The last working version of convert.py is [aa23412](https://github.com/ggerganov/llama.cpp/tree/aa2341298924ac89778252015efcb792f2df1e20)

The changes that break it are in [4d4d236](https://github.com/ggerganov/llama.cpp/commit/4d4d2366fc9c54d4a275065cfe9299c6cf7c5b78)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

#5821 breaks convert.py for models with additional vocabulary, resulting in garbage output #6238

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#5821 breaks convert.py for models with additional vocabulary, resulting in garbage output #6238

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions