llama : improve infill support and special token detection #9798

ggerganov · 2024-10-09T07:49:59Z

Separating some of the changes from #9787 in this separate PR for clarity:

libllama now auto-detects FIM-based tokens based on their text pieces
add PAD, REPO and SEP FIM tokens
deprecate old llama_token_prefix, llama_token_suffix and llama_token_middle API in favor of new llama_token_fim_ API
llama-server now correctly tokenizes prefix and suffix prompts by not parsing special tokens (/infill endpoint)

API Changes

Deprecate

llama_token llama_token_prefix(const struct llama_model * model);
llama_token llama_token_middle(const struct llama_model * model);
llama_token llama_token_suffix(const struct llama_model * model);

Add

llama_token llama_token_fim_pre(const struct llama_model * model);
llama_token llama_token_fim_suf(const struct llama_model * model);
llama_token llama_token_fim_mid(const struct llama_model * model);
llama_token llama_token_fim_pad(const struct llama_model * model);
llama_token llama_token_fim_rep(const struct llama_model * model);
llama_token llama_token_fim_sep(const struct llama_model * model);

Example

# serve FIM-compatible model
./llama-server \
    -m ./models/qwen2.5-7b-coder/ggml-model-q8_0.gguf \
    --port 8012 --ctx-size 32768 -ngl 99 -np 1 -fa

# sample FIM request
curl \
    --silent --no-buffer --request POST --url http://127.0.0.1:8012/infill \
    --header "Content-Type: application/json" \
    --data '{"input_prefix": "#include <cstdio>\n\nint main() {\n    printf(", "input_suffix": ");\n    return 0;\n}\n", "temperature": 0.1, "n_predict": 64, "stream": false, "top_k": 5, "prompt": ""}' | jq

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

CISC · 2024-10-09T15:54:52Z

src/llama.cpp

+    { LLM_KV_TOKENIZER_FIM_PRE_ID,             "tokenizer.ggml.fim_pre_token_id"         },
+    { LLM_KV_TOKENIZER_FIM_SUF_ID,             "tokenizer.ggml.fim_suf_token_id"         },
+    { LLM_KV_TOKENIZER_FIM_MID_ID,             "tokenizer.ggml.fim_mid_token_id"         },


Changing the name of the metadata keys for these does not seem like a great idea, the tokens themselves are still the same and there are many GGUFs out there with them, changing this will only cause headaches...

I thought it would be OK because the convert script does not export FIM tokens for any model. Initially, this was the hope, that the tokens would be exported during convert to the respective GGUV KVs in the meta data, but this never happened and there is little hope that it will happen in the future.

The auto-detection of the FIM (and other special) tokens, although in theory not perfect, should work in all cases. Even if the old KVs are used or not. So I think it is a safe change, in order to improve the consistency of the names.

Still, let me know if you think I'm missing something.

I thought it would be OK because the convert script does not export FIM tokens for any model. Initially, this was the hope, that the tokens would be exported during convert to the respective GGUV KVs in the meta data, but this never happened and there is little hope that it will happen in the future.

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.

The auto-detection of the FIM (and other special) tokens, although in theory not perfect, should work in all cases. Even if the old KVs are used or not. So I think it is a safe change, in order to improve the consistency of the names.

While the tokens that are auto-detected will cover 99% of the models currently out there, even though a model has these tokens in the vocabulary does not mean it's a good idea to use them, quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage) if you attempt to use them. In these cases it's usually much better to just drop the suffix and feed the model the prefix text and let it generate the rest (this is in fact what llama-cpp-python will do if there are no prefix/suffix tokens).

Still, let me know if you think I'm missing something.

Fill-in-Middle is a tricky subject, even if you ignore the whole PSM/SPM aspect, just look at Yi-Coder-9B, unfortunately I don't think we can just assume it's possible to handle this automatically.

While the tokens that are auto-detected will cover 99% of the models currently out there, even though a model has these tokens in the vocabulary does not mean it's a good idea to use them

That's fine, with this change we are not enforcing the user to use FIM tokens or not. We just want to make it more robust to correctly identify which tokens could be used for FIM when the model is loaded. By "should work in all cases" I am not referring to the code completion use cases, but to the cases of loading a model and having the FIM tokens correctly identified. It's entirely up to the application to decide whether to use them or not.

quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage)

Hm, that's strange. My experience with DeepSeekCoder-6.7B + FIM has been quite positive.

Fill-in-Middle is a tricky subject

Yes, I agree. My idea for these changes is the following (in the context of llama-server):

If you don't want to use FIM with a model, use the /completion endpoint to pass prefix and let it complete it. This path does not require FIM tokens to exist.

If you want to use FIM, use the /infill endpoint to pass the prefix and suffix. This would trigger checks that the model does support FIM and has the FIM tokens be known, otherwise it will return an error.

IMO, the FIM workflow is essential for efficient LLM-based code completion, so it needs to become more robust, regardless if there are models that do not support it.

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.

I think all these models will continue to work exactly the same way with this branch, thanks to the auto-detection logic. Is it not the case?

That's fine, with this change we are not enforcing the user to use FIM tokens or not. We just want to make it more robust to correctly identify which tokens could be used for FIM when the model is loaded. By "should work in all cases" I am not referring to the code completion use cases, but to the cases of loading a model and having the FIM tokens correctly identified. It's entirely up to the application to decide whether to use them or not.

Well, it shifts the burden (which may or may not be fine, but application maintainers will need to be made aware and adapt for this) from the model metadata defining what tokens are acceptable to use or not, to the user/application.

quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage)
Hm, that's strange. My experience with DeepSeekCoder-6.7B + FIM has been quite positive.

The instruct model? IIRC I tested this and could only get the 33B one to give somewhat reasonable results. It could just have been other fine-tunes however (like OpenCodeInterpreter)...

Fill-in-Middle is a tricky subject
Yes, I agree. My idea for these changes is the following (in the context of llama-server):

If you don't want to use FIM with a model, use the /completion endpoint to pass prefix and let it complete it. This path does not require FIM tokens to exist.

If you want to use FIM, use the /infill endpoint to pass the prefix and suffix. This would trigger checks that the model does support FIM and has the FIM tokens be known, otherwise it will return an error.

This approach does indeed seem reasonable, however /infill is not in the OpenAI spec, so I don't think any other server implementations do it this way (and you can't get GitHub Copilot to use /infill)?

IMO, the FIM workflow is essential for efficient LLM-based code completion, so it needs to become more robust, regardless if there are models that do not support it.

For sure, I use it daily, it is somewhat of a pet peeve of mine when GGUFs are made without the necessary metadata, so anything that can make this work smoother is definitely welcome! :)

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.
I think all these models will continue to work exactly the same way with this branch, thanks to the auto-detection logic. Is it not the case?

Yes, though, ironically not DeepSeek because they use different tokens that you have not added. :)

With 3681540 we now keep the old PREFIX, SUFFIX and MIDDLE KVs as deprecated, so backwards compatibility should be OK.

This approach does indeed seem reasonable, however /infill is not in the OpenAI spec, so I don't think any other server implementations do it this way (and you can't get GitHub Copilot to use /infill)?

AFAIK, all servers that are Github Copilot compatible rely on proxying the requests from the Codex completion API. Proxying to the /infill endpoint should be trivial.

Yes, though, ironically not DeepSeek because they use different tokens that you have not added

Should be added now.

AFAIK, all servers that are Github Copilot compatible rely on proxying the requests from the Codex completion API. Proxying to the /infill endpoint should be trivial.

Not quite, /infill doesn't do streaming (and you would have to rewrite the parameters).

It actually supports streaming, the README is incorrect. I've fixed that in the upcoming #9787.

CISC · 2024-10-09T15:56:17Z

src/llama.cpp

+    { LLM_KV_TOKENIZER_FIM_PAD_ID,             "tokenizer.ggml.fim_pad_token_id"         },
+    { LLM_KV_TOKENIZER_FIM_REP_ID,             "tokenizer.ggml.fim_rep_token_id"         },
+    { LLM_KV_TOKENIZER_FIM_SEP_ID,             "tokenizer.ggml.fim_sep_token_id"         },


These need to be added to gguf-py/gguf/constants.py too.

ggml-ci

CISC · 2024-10-12T09:15:27Z

I've updated GGUF Editor with the new metadata and marked the old ones as deprecated.

…#9798) * llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (ggerganov#9800) * gguf : deprecate old FIM token KVs

github-actions bot added examples server labels Oct 9, 2024

ggerganov force-pushed the gg/infill-0 branch 2 times, most recently from 74c539e to 61a66f2 Compare October 9, 2024 07:57

ggerganov mentioned this pull request Oct 9, 2024

llama.vim : plugin for Neovim #9787

Merged

7 tasks

CISC reviewed Oct 9, 2024

View reviewed changes

ggerganov added 4 commits October 11, 2024 09:54

llama : improve infill support

3a8a89a

ggml-ci

llama : add more FIM token strings

0fb9c91

ggml-ci

server : update prompt on slot restore (#9800)

3ae8670

gguf : deprecate old FIM token KVs

3681540

ggerganov force-pushed the gg/infill-0 branch from 32da4a2 to 3681540 Compare October 11, 2024 07:07

github-actions bot added the python python script changes label Oct 11, 2024

ggerganov changed the title ~~llama : improve infill support~~ llama : improve infill support and special token detection Oct 12, 2024

ggerganov merged commit 11ac980 into master Oct 12, 2024
56 checks passed

ggerganov deleted the gg/infill-0 branch October 12, 2024 05:21

ggerganov mentioned this pull request Oct 12, 2024

changelog : libllama API #9289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : improve infill support and special token detection #9798

llama : improve infill support and special token detection #9798

ggerganov commented Oct 9, 2024 •

edited

Loading

CISC Oct 9, 2024

ggerganov Oct 9, 2024

CISC Oct 9, 2024

ggerganov Oct 10, 2024

CISC Oct 10, 2024

ggerganov Oct 11, 2024

CISC Oct 11, 2024

ggerganov Oct 11, 2024

CISC Oct 9, 2024

CISC commented Oct 12, 2024

llama : improve infill support and special token detection #9798

llama : improve infill support and special token detection #9798

Conversation

ggerganov commented Oct 9, 2024 • edited Loading

API Changes

Example

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CISC commented Oct 12, 2024

ggerganov commented Oct 9, 2024 •

edited

Loading