-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : improve infill support and special token detection #9798
Conversation
74c539e
to
61a66f2
Compare
{ LLM_KV_TOKENIZER_FIM_PRE_ID, "tokenizer.ggml.fim_pre_token_id" }, | ||
{ LLM_KV_TOKENIZER_FIM_SUF_ID, "tokenizer.ggml.fim_suf_token_id" }, | ||
{ LLM_KV_TOKENIZER_FIM_MID_ID, "tokenizer.ggml.fim_mid_token_id" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the name of the metadata keys for these does not seem like a great idea, the tokens themselves are still the same and there are many GGUFs out there with them, changing this will only cause headaches...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be OK because the convert
script does not export FIM tokens for any model. Initially, this was the hope, that the tokens would be exported during convert to the respective GGUV KVs in the meta data, but this never happened and there is little hope that it will happen in the future.
The auto-detection of the FIM (and other special) tokens, although in theory not perfect, should work in all cases. Even if the old KVs are used or not. So I think it is a safe change, in order to improve the consistency of the names.
Still, let me know if you think I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be OK because the
convert
script does not export FIM tokens for any model. Initially, this was the hope, that the tokens would be exported during convert to the respective GGUV KVs in the meta data, but this never happened and there is little hope that it will happen in the future.
That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.
The auto-detection of the FIM (and other special) tokens, although in theory not perfect, should work in all cases. Even if the old KVs are used or not. So I think it is a safe change, in order to improve the consistency of the names.
While the tokens that are auto-detected will cover 99% of the models currently out there, even though a model has these tokens in the vocabulary does not mean it's a good idea to use them, quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage) if you attempt to use them. In these cases it's usually much better to just drop the suffix and feed the model the prefix text and let it generate the rest (this is in fact what llama-cpp-python
will do if there are no prefix/suffix tokens).
Still, let me know if you think I'm missing something.
Fill-in-Middle is a tricky subject, even if you ignore the whole PSM/SPM aspect, just look at Yi-Coder-9B, unfortunately I don't think we can just assume it's possible to handle this automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the tokens that are auto-detected will cover 99% of the models currently out there, even though a model has these tokens in the vocabulary does not mean it's a good idea to use them
That's fine, with this change we are not enforcing the user to use FIM tokens or not. We just want to make it more robust to correctly identify which tokens could be used for FIM when the model is loaded. By "should work in all cases" I am not referring to the code completion use cases, but to the cases of loading a model and having the FIM tokens correctly identified. It's entirely up to the application to decide whether to use them or not.
quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage)
Hm, that's strange. My experience with DeepSeekCoder-6.7B + FIM has been quite positive.
Fill-in-Middle is a tricky subject
Yes, I agree. My idea for these changes is the following (in the context of llama-server
):
- If you don't want to use FIM with a model, use the
/completion
endpoint to pass prefix and let it complete it. This path does not require FIM tokens to exist. - If you want to use FIM, use the
/infill
endpoint to pass the prefix and suffix. This would trigger checks that the model does support FIM and has the FIM tokens be known, otherwise it will return an error.
IMO, the FIM workflow is essential for efficient LLM-based code completion, so it needs to become more robust, regardless if there are models that do not support it.
That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.
I think all these models will continue to work exactly the same way with this branch, thanks to the auto-detection logic. Is it not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, with this change we are not enforcing the user to use FIM tokens or not. We just want to make it more robust to correctly identify which tokens could be used for FIM when the model is loaded. By "should work in all cases" I am not referring to the code completion use cases, but to the cases of loading a model and having the FIM tokens correctly identified. It's entirely up to the application to decide whether to use them or not.
Well, it shifts the burden (which may or may not be fine, but application maintainers will need to be made aware and adapt for this) from the model metadata defining what tokens are acceptable to use or not, to the user/application.
quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage)
Hm, that's strange. My experience with DeepSeekCoder-6.7B + FIM has been quite positive.
The instruct model? IIRC I tested this and could only get the 33B one to give somewhat reasonable results. It could just have been other fine-tunes however (like OpenCodeInterpreter)...
Fill-in-Middle is a tricky subject
Yes, I agree. My idea for these changes is the following (in the context ofllama-server
):
- If you don't want to use FIM with a model, use the
/completion
endpoint to pass prefix and let it complete it. This path does not require FIM tokens to exist.- If you want to use FIM, use the
/infill
endpoint to pass the prefix and suffix. This would trigger checks that the model does support FIM and has the FIM tokens be known, otherwise it will return an error.
This approach does indeed seem reasonable, however /infill
is not in the OpenAI spec, so I don't think any other server implementations do it this way (and you can't get GitHub Copilot to use /infill
)?
IMO, the FIM workflow is essential for efficient LLM-based code completion, so it needs to become more robust, regardless if there are models that do not support it.
For sure, I use it daily, it is somewhat of a pet peeve of mine when GGUFs are made without the necessary metadata, so anything that can make this work smoother is definitely welcome! :)
That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.
I think all these models will continue to work exactly the same way with this branch, thanks to the auto-detection logic. Is it not the case?
Yes, though, ironically not DeepSeek because they use different tokens that you have not added. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With 3681540 we now keep the old PREFIX, SUFFIX and MIDDLE KVs as deprecated, so backwards compatibility should be OK.
This approach does indeed seem reasonable, however /infill is not in the OpenAI spec, so I don't think any other server implementations do it this way (and you can't get GitHub Copilot to use /infill)?
AFAIK, all servers that are Github Copilot compatible rely on proxying the requests from the Codex completion API. Proxying to the /infill
endpoint should be trivial.
Yes, though, ironically not DeepSeek because they use different tokens that you have not added
Should be added now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, all servers that are Github Copilot compatible rely on proxying the requests from the Codex completion API. Proxying to the
/infill
endpoint should be trivial.
Not quite, /infill
doesn't do streaming (and you would have to rewrite the parameters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually supports streaming, the README is incorrect. I've fixed that in the upcoming #9787.
{ LLM_KV_TOKENIZER_FIM_PAD_ID, "tokenizer.ggml.fim_pad_token_id" }, | ||
{ LLM_KV_TOKENIZER_FIM_REP_ID, "tokenizer.ggml.fim_rep_token_id" }, | ||
{ LLM_KV_TOKENIZER_FIM_SEP_ID, "tokenizer.ggml.fim_sep_token_id" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need to be added to gguf-py/gguf/constants.py
too.
32da4a2
to
3681540
Compare
I've updated GGUF Editor with the new metadata and marked the old ones as deprecated. |
…#9798) * llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (ggerganov#9800) * gguf : deprecate old FIM token KVs
…#9798) * llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (ggerganov#9800) * gguf : deprecate old FIM token KVs
…#9798) * llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (ggerganov#9800) * gguf : deprecate old FIM token KVs
…#9798) * llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (ggerganov#9800) * gguf : deprecate old FIM token KVs
Separating some of the changes from #9787 in this separate PR for clarity:
libllama
now auto-detects FIM-based tokens based on their text piecesPAD
,REPO
andSEP
FIM tokensllama_token_prefix
,llama_token_suffix
andllama_token_middle
API in favor of newllama_token_fim_
APIllama-server
now correctly tokenizes prefix and suffix prompts by not parsing special tokens (/infill
endpoint)API Changes
Deprecate
Add
Example