Closed
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Description
Provide support for phi-2. Running the following yields an error:
python -c "
from huggingface_hub import snapshot_download;
snapshot_download(repo_id='microsoft/phi-2', local_dir='phi-2', local_dir_use_symlinks=False)
"
python convert-hf-to-gguf.py phi-2/ --outtype f16
Error:
Traceback (most recent call last):
File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 3001, in <module>
main()
File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 2988, in main
model_instance.set_vocab()
File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 75, in set_vocab
self._set_vocab_gpt2()
File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 331, in _set_vocab_gpt2
tokens, toktypes, tokpre = self.get_vocab_base()
File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 242, in get_vocab_base
tokpre = self.get_vocab_base_pre(tokenizer)
File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 323, in get_vocab_base_pre
raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()
Phi-2 uses CodeGenTokenizer
, which is a BPE Tokenizer.
I'm not sure if it is as easy as adding the following line here?
{ "name": "phi-2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-2" },
Edit tried that, this is the generated hash:
if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
# ref: https://huggingface.co/microsoft/phi-2
res = "phi-2"