Description
#5821 has changed the behaviour of how to convert models from HF with additional vocabulary, when the --pad-vocab
option is needed, and the resulting conversion is unusable in llama.cpp and other clients such as LM Studio. I have tested it on multiple models, and others have reported the same. One example of such a model is WhiteRabbitNeo/WhiteRabbitNeo-33B-v1.5. Here is the error output:
./llama.cpp/main -m WhiteRabbitNeo/WhiteRabbitNeo-33B-v1.5/ggml-model-f16.gguf -n 128
(...)
**Error**
libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found
zsh: abort ./llama.cpp/main -m WhiteRabbitNeo/WhiteRabbitNeo-33B-v1.5/ggml-model-f16.gguf -n 128
This happens both when only --pad-vocab
or --pad-vocab --vocab-type hfft
are used during conversion. The other vocab types, bpe
and spm
, result in a failed conversion.
With --pad-vocab --vocab-type bpe
:
Found vocab files: {'spm': None, 'bpe': None, 'hfft': PosixPath('WhiteRabbitNeo-33B-v1.5/tokenizer.json')}
Traceback (most recent call last):
File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1479, in <module>
main()
File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1447, in main
vocab, special_vocab = vocab_factory.load_vocab(args.vocab_type.split(","), model_parent_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1323, in load_vocab
vocab_type, path = self._select_file(vocab_types)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1310, in _select_file
raise FileNotFoundError(f"Could not find any of {[self._FILES[vt] for vt in vocab_types]}")
FileNotFoundError: Could not find any of ['vocab.json']
With --pad-vocab --vocab-type spm
:
Found vocab files: {'spm': None, 'bpe': None, 'hfft': PosixPath('WhiteRabbitNeo-33B-v1.5/tokenizer.json')}
Traceback (most recent call last):
File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1479, in <module>
main()
File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1447, in main
vocab, special_vocab = vocab_factory.load_vocab(args.vocab_type.split(","), model_parent_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1323, in load_vocab
vocab_type, path = self._select_file(vocab_types)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1310, in _select_file
raise FileNotFoundError(f"Could not find any of {[self._FILES[vt] for vt in vocab_types]}")
FileNotFoundError: Could not find any of ['tokenizer.model']
Previously, the required options were --pad-vocab --vocab-type bpe
, and everything worked as expected. I have tested previous versions, and all the ones up to this one works. The last working version of convert.py is aa23412
The changes that break it are in 4d4d236