Skip to content

#5821 breaks convert.py for models with additional vocabulary, resulting in garbage output #6238

Closed
@froggeric

Description

@froggeric

#5821 has changed the behaviour of how to convert models from HF with additional vocabulary, when the --pad-vocab option is needed, and the resulting conversion is unusable in llama.cpp and other clients such as LM Studio. I have tested it on multiple models, and others have reported the same. One example of such a model is WhiteRabbitNeo/WhiteRabbitNeo-33B-v1.5. Here is the error output:

./llama.cpp/main -m WhiteRabbitNeo/WhiteRabbitNeo-33B-v1.5/ggml-model-f16.gguf -n 128
(...)
**Error**
libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found
zsh: abort      ./llama.cpp/main -m WhiteRabbitNeo/WhiteRabbitNeo-33B-v1.5/ggml-model-f16.gguf -n 128

This happens both when only --pad-vocab or --pad-vocab --vocab-type hfft are used during conversion. The other vocab types, bpe and spm, result in a failed conversion.

With --pad-vocab --vocab-type bpe:

Found vocab files: {'spm': None, 'bpe': None, 'hfft': PosixPath('WhiteRabbitNeo-33B-v1.5/tokenizer.json')}
Traceback (most recent call last):
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1479, in <module>
    main()
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1447, in main
    vocab, special_vocab = vocab_factory.load_vocab(args.vocab_type.split(","), model_parent_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1323, in load_vocab
    vocab_type, path = self._select_file(vocab_types)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1310, in _select_file
    raise FileNotFoundError(f"Could not find any of {[self._FILES[vt] for vt in vocab_types]}")
FileNotFoundError: Could not find any of ['vocab.json']

With --pad-vocab --vocab-type spm:

Found vocab files: {'spm': None, 'bpe': None, 'hfft': PosixPath('WhiteRabbitNeo-33B-v1.5/tokenizer.json')}
Traceback (most recent call last):
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1479, in <module>
    main()
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1447, in main
    vocab, special_vocab = vocab_factory.load_vocab(args.vocab_type.split(","), model_parent_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1323, in load_vocab
    vocab_type, path = self._select_file(vocab_types)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/ssd/ai/llm-dev/llama.cpp/convert-4d4d236.py", line 1310, in _select_file
    raise FileNotFoundError(f"Could not find any of {[self._FILES[vt] for vt in vocab_types]}")
FileNotFoundError: Could not find any of ['tokenizer.model']

Previously, the required options were --pad-vocab --vocab-type bpe, and everything worked as expected. I have tested previous versions, and all the ones up to this one works. The last working version of convert.py is aa23412

The changes that break it are in 4d4d236

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions