Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : improve infill support and special token detection #9798

Merged
merged 4 commits into from
Oct 12, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 9, 2024

Separating some of the changes from #9787 in this separate PR for clarity:

  • libllama now auto-detects FIM-based tokens based on their text pieces
  • add PAD, REPO and SEP FIM tokens
  • deprecate old llama_token_prefix, llama_token_suffix and llama_token_middle API in favor of new llama_token_fim_ API
  • llama-server now correctly tokenizes prefix and suffix prompts by not parsing special tokens (/infill endpoint)

API Changes

  • Deprecate

    llama_token llama_token_prefix(const struct llama_model * model);
    llama_token llama_token_middle(const struct llama_model * model);
    llama_token llama_token_suffix(const struct llama_model * model);
  • Add

    llama_token llama_token_fim_pre(const struct llama_model * model);
    llama_token llama_token_fim_suf(const struct llama_model * model);
    llama_token llama_token_fim_mid(const struct llama_model * model);
    llama_token llama_token_fim_pad(const struct llama_model * model);
    llama_token llama_token_fim_rep(const struct llama_model * model);
    llama_token llama_token_fim_sep(const struct llama_model * model);

Example

# serve FIM-compatible model
./llama-server \
    -m ./models/qwen2.5-7b-coder/ggml-model-q8_0.gguf \
    --port 8012 --ctx-size 32768 -ngl 99 -np 1 -fa

# sample FIM request
curl \
    --silent --no-buffer --request POST --url http://127.0.0.1:8012/infill \
    --header "Content-Type: application/json" \
    --data '{"input_prefix": "#include <cstdio>\n\nint main() {\n    printf(", "input_suffix": ");\n    return 0;\n}\n", "temperature": 0.1, "n_predict": 64, "stream": false, "top_k": 5, "prompt": ""}' | jq

@ggerganov ggerganov force-pushed the gg/infill-0 branch 2 times, most recently from 74c539e to 61a66f2 Compare October 9, 2024 07:57
@ggerganov ggerganov mentioned this pull request Oct 9, 2024
7 tasks
Comment on lines +477 to +480
{ LLM_KV_TOKENIZER_FIM_PRE_ID, "tokenizer.ggml.fim_pre_token_id" },
{ LLM_KV_TOKENIZER_FIM_SUF_ID, "tokenizer.ggml.fim_suf_token_id" },
{ LLM_KV_TOKENIZER_FIM_MID_ID, "tokenizer.ggml.fim_mid_token_id" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the name of the metadata keys for these does not seem like a great idea, the tokens themselves are still the same and there are many GGUFs out there with them, changing this will only cause headaches...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be OK because the convert script does not export FIM tokens for any model. Initially, this was the hope, that the tokens would be exported during convert to the respective GGUV KVs in the meta data, but this never happened and there is little hope that it will happen in the future.

The auto-detection of the FIM (and other special) tokens, although in theory not perfect, should work in all cases. Even if the old KVs are used or not. So I think it is a safe change, in order to improve the consistency of the names.

Still, let me know if you think I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be OK because the convert script does not export FIM tokens for any model. Initially, this was the hope, that the tokens would be exported during convert to the respective GGUV KVs in the meta data, but this never happened and there is little hope that it will happen in the future.

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.

The auto-detection of the FIM (and other special) tokens, although in theory not perfect, should work in all cases. Even if the old KVs are used or not. So I think it is a safe change, in order to improve the consistency of the names.

While the tokens that are auto-detected will cover 99% of the models currently out there, even though a model has these tokens in the vocabulary does not mean it's a good idea to use them, quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage) if you attempt to use them. In these cases it's usually much better to just drop the suffix and feed the model the prefix text and let it generate the rest (this is in fact what llama-cpp-python will do if there are no prefix/suffix tokens).

Still, let me know if you think I'm missing something.

Fill-in-Middle is a tricky subject, even if you ignore the whole PSM/SPM aspect, just look at Yi-Coder-9B, unfortunately I don't think we can just assume it's possible to handle this automatically.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the tokens that are auto-detected will cover 99% of the models currently out there, even though a model has these tokens in the vocabulary does not mean it's a good idea to use them

That's fine, with this change we are not enforcing the user to use FIM tokens or not. We just want to make it more robust to correctly identify which tokens could be used for FIM when the model is loaded. By "should work in all cases" I am not referring to the code completion use cases, but to the cases of loading a model and having the FIM tokens correctly identified. It's entirely up to the application to decide whether to use them or not.

quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage)

Hm, that's strange. My experience with DeepSeekCoder-6.7B + FIM has been quite positive.

Fill-in-Middle is a tricky subject

Yes, I agree. My idea for these changes is the following (in the context of llama-server):

  • If you don't want to use FIM with a model, use the /completion endpoint to pass prefix and let it complete it. This path does not require FIM tokens to exist.
  • If you want to use FIM, use the /infill endpoint to pass the prefix and suffix. This would trigger checks that the model does support FIM and has the FIM tokens be known, otherwise it will return an error.

IMO, the FIM workflow is essential for efficient LLM-based code completion, so it needs to become more robust, regardless if there are models that do not support it.

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.

I think all these models will continue to work exactly the same way with this branch, thanks to the auto-detection logic. Is it not the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, with this change we are not enforcing the user to use FIM tokens or not. We just want to make it more robust to correctly identify which tokens could be used for FIM when the model is loaded. By "should work in all cases" I am not referring to the code completion use cases, but to the cases of loading a model and having the FIM tokens correctly identified. It's entirely up to the application to decide whether to use them or not.

Well, it shifts the burden (which may or may not be fine, but application maintainers will need to be made aware and adapt for this) from the model metadata defining what tokens are acceptable to use or not, to the user/application.

quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage)
Hm, that's strange. My experience with DeepSeekCoder-6.7B + FIM has been quite positive.

The instruct model? IIRC I tested this and could only get the 33B one to give somewhat reasonable results. It could just have been other fine-tunes however (like OpenCodeInterpreter)...

Fill-in-Middle is a tricky subject
Yes, I agree. My idea for these changes is the following (in the context of llama-server):

  • If you don't want to use FIM with a model, use the /completion endpoint to pass prefix and let it complete it. This path does not require FIM tokens to exist.
  • If you want to use FIM, use the /infill endpoint to pass the prefix and suffix. This would trigger checks that the model does support FIM and has the FIM tokens be known, otherwise it will return an error.

This approach does indeed seem reasonable, however /infill is not in the OpenAI spec, so I don't think any other server implementations do it this way (and you can't get GitHub Copilot to use /infill)?

IMO, the FIM workflow is essential for efficient LLM-based code completion, so it needs to become more robust, regardless if there are models that do not support it.

For sure, I use it daily, it is somewhat of a pet peeve of mine when GGUFs are made without the necessary metadata, so anything that can make this work smoother is definitely welcome! :)

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.
I think all these models will continue to work exactly the same way with this branch, thanks to the auto-detection logic. Is it not the case?

Yes, though, ironically not DeepSeek because they use different tokens that you have not added. :)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 3681540 we now keep the old PREFIX, SUFFIX and MIDDLE KVs as deprecated, so backwards compatibility should be OK.

This approach does indeed seem reasonable, however /infill is not in the OpenAI spec, so I don't think any other server implementations do it this way (and you can't get GitHub Copilot to use /infill)?

AFAIK, all servers that are Github Copilot compatible rely on proxying the requests from the Codex completion API. Proxying to the /infill endpoint should be trivial.

Yes, though, ironically not DeepSeek because they use different tokens that you have not added

Should be added now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, all servers that are Github Copilot compatible rely on proxying the requests from the Codex completion API. Proxying to the /infill endpoint should be trivial.

Not quite, /infill doesn't do streaming (and you would have to rewrite the parameters).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually supports streaming, the README is incorrect. I've fixed that in the upcoming #9787.

Comment on lines +480 to +483
{ LLM_KV_TOKENIZER_FIM_PAD_ID, "tokenizer.ggml.fim_pad_token_id" },
{ LLM_KV_TOKENIZER_FIM_REP_ID, "tokenizer.ggml.fim_rep_token_id" },
{ LLM_KV_TOKENIZER_FIM_SEP_ID, "tokenizer.ggml.fim_sep_token_id" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be added to gguf-py/gguf/constants.py too.

@github-actions github-actions bot added the python python script changes label Oct 11, 2024
@ggerganov ggerganov changed the title llama : improve infill support llama : improve infill support and special token detection Oct 12, 2024
@ggerganov ggerganov merged commit 11ac980 into master Oct 12, 2024
56 checks passed
@ggerganov ggerganov deleted the gg/infill-0 branch October 12, 2024 05:21
@CISC
Copy link
Contributor

CISC commented Oct 12, 2024

I've updated GGUF Editor with the new metadata and marked the old ones as deprecated.

drollings pushed a commit to drollings/llama.cpp that referenced this pull request Oct 18, 2024
…#9798)

* llama : improve infill support

ggml-ci

* llama : add more FIM token strings

ggml-ci

* server : update prompt on slot restore (ggerganov#9800)

* gguf : deprecate old FIM token KVs
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
…#9798)

* llama : improve infill support

ggml-ci

* llama : add more FIM token strings

ggml-ci

* server : update prompt on slot restore (ggerganov#9800)

* gguf : deprecate old FIM token KVs
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
…#9798)

* llama : improve infill support

ggml-ci

* llama : add more FIM token strings

ggml-ci

* server : update prompt on slot restore (ggerganov#9800)

* gguf : deprecate old FIM token KVs
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
…#9798)

* llama : improve infill support

ggml-ci

* llama : add more FIM token strings

ggml-ci

* server : update prompt on slot restore (ggerganov#9800)

* gguf : deprecate old FIM token KVs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants