-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration of Jinja2 Templating #875
Conversation
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
- Simplify the `llama2_template` in `llama_jinja_format.py` by removing unnecessary line breaks for readability without affecting functionality. - Update `ChatFormatterInterface` constructor to accept a more generic `Optional[object]` type for the template parameter, enhancing flexibility. - Introduce a `template` property to `ChatFormatterInterface` for standardized access to the template string. - Replace `MetaSingleton` metaclass with `Singleton` for the `ChatFormatterFactory` to streamline the singleton implementation. These changes enhance code readability, maintain usability, and ensure consistency in the chat formatter's design pattern usage.
this looks very clean/simple. and flexible enough to handle everything. and if we allow the user to include a template "as a chat format", in addition to registering names, then it can accommodate things we haven't thought of. |
8c93cf8
to
cc0fe43
Compare
Hey @teleprint-me this looks great! I think with the merging of ggerganov/llama.cpp#4125 we can now use your approach to automagically get a chat format without having to rely only on presets (assuming the chat formats are included for new quantized models). |
I left it separated to avoid conflicts with any changes you implemented. How would you like to handle it? |
@teleprint-me next steps I see:
I think this order makes the most sense and avoids breaking backwards compatibility. |
Sounds good! I'll do my best to get around to it over the weekend. I haven't had as much free time lately, but this is a high priority for me. I'll see what I can do and keep you in the loop. Let me know if anything changes in the mean time. |
@teleprint-me thank you so much, I think this will be very helpful for a lot of people, let me know if you need any extra help! In terms of conflicts there shouldn't be any I'm just working on some performance features for batching / speculative decoding and that should all be independant of the chat format work. |
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
…ormers compatibility Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
It's good to see this supporting multiple system messages, that's one limitation of the current templates. Some clients insert system messages at positions other than the start to provide instructions when using the chat completion API (e.g. SillyTavern, and the Herika Skyrim mod). Will this PR replace the existing templates? I fixed the system role mapping for one existing format locally but don't want to create unnecessary conflicts. I definitely think Jinja templates are a good way to go. Function templates might also be worth considering for the future, at least for what is passed to the model. The current system for that is fairly complicated. |
If I understood correctly, no, it will not replace the currently existing templates seeing as @abetlen is looking to avoid breaking backwards compatibility with the current API. This makes integration a bit more complicated which is why I'm taking my time with it. That, and I'm in the middle of a bunch of personal projects, so I only have a limited amount of time to spend on each of them and this isn't including work. How can I get the metadata from the models gguf file so I can extract the chat template if it exists? I didn't see in the spec... maybe I missed it? |
@teleprint-me FWIW I think we can replace the implementation of some of the existing chat formats with this simpler Jinja2 approach, it shouldn't break anything on the API. The multi-modal models and the function calling models don't quite fit into this approach because they're not just using simple prompts but we can tackle those later.
If a model supports it you should just be able to call buflen = 2048 # should be enough, unsure
buf = (ctypes.c_char * buflen)(0)
llama_cpp.llama_model_meta_val_str(llama.model, b"tokenizer.chat_template", buf, ctypes.sizeof(buf)) Unfortunately I just checked and it doesn't look to be too popular yet, in theory https://huggingface.co/TheBloke/OpenHermes-2-Mistral-7B-GGUF should support it but I believe the gguf files where generated with a llama.cpp version before the PR was merged. One way to test would be if someone re-quantized from the base model which supports a |
Exploring
|
- Changed attribute name from `self._renderer` to `self._environment`
Hey guys, thanks a lot for working on this 🙏 While doing my investigation for this feature in https://github.com/imartinez/privateGPT I also came to read the parsing of the metadata of the GGUF files, but as well as it's creation, in the While reading their implementation, I found out that Looking at the usages of Is there a discord or something like this for Do not hesitate to come and say hi at Private's GPT one: (channel |
messages: List[Dict[str, str]], | ||
**kwargs: Any, | ||
) -> ChatFormatterResponse: | ||
formatted_sequence = self._environment.render(messages=messages, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During my exploration on existing chat_template
, I found out that, usually, they are using functions, such as raise_exception
.
It looks like their might be some elegant solutions to define such method, leveraging the Jinja env (see https://stackoverflow.com/a/29262304).
Otherwise, I guess you can heavily inspire yourself from HF's transformers
implementation (c.f. the usage guide: https://huggingface.co/docs/transformers/main/chat_templating) of AutoTokenizer.from_pretrained("xxx/model-name").apply_chat_template(chat, tokenize=False)
Examples of chat_templates
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here is a nice entry-point line in transformers
to follow to see how they are rendering this jinja template (I basically did a Ctrl + F
to find it): https://github.com/huggingface/transformers/blob/74a3cebfa51b539bfcfa79b33686cc090b7074e8/src/transformers/tokenization_utils_base.py#L1600
I feel like I'm holding this PR hostage and I haven't had time to dig into to it. It technically needs to be integrated, so I'm going to allow it for review so you can do what you need to do. Let me know if you need anything. The chat templates should be integrated into the latest gguf's. I've been testing Mixtral and they show up during the conversion process when the |
Thought I'd add this here because I was experimenting with the latest llama.cpp updates. The new The new The following is an example on how to do it at a high-level. """
main.py - example file to experiment with extracting the chat template from the models metadata
"""
from __future__ import annotations
from gguf import GGUFReader, Keys
def get_chat_template(model_file: str) -> str:
reader = GGUFReader(model_file)
# Access the 'chat_template' field directly using its key
chat_template_field = reader.fields[Keys.Tokenizer.CHAT_TEMPLATE]
# Extract the chat template string from the field
chat_template_memmap = chat_template_field.parts[-1]
chat_template_string = chat_template_memmap.tobytes().decode("utf-8")
return chat_template_string
def main() -> None:
# this is just an exercise to determine how it might be done in practice
model_file = "models/mistralai/Mixtral-8x7B-Instruct-v0.1/Mixtral-8x7B-Instruct-v0.1-q4_0.gguf"
chat_template = get_chat_template(model_file)
print(chat_template)
if __name__ == "__main__":
main() Which results in: 00:49:13 | ~/Local/llama_cpp_client
(.venv) git:(main | Δ) λ python main.py
{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %} So, |
Hey @teleprint-me thanks for all the work on this, I've merged it in now. I still need to do a little bit to adapt this to the |
Refactor Chat Templating to Utilize Jinja2
Overview
This pull request introduces a significant refactor of the chat templating system within the
llama-cpp-python
project. The primary objective is to simplify template management, enhance flexibility, and minimize dependencies by leveraging Jinja2's templating engine.Changes Introduced
Code Snippet
Benefits
Discussion Points
Conclusion
This update aims to strike an ideal balance between the sophistication of the templating features and the maintenance simplicity desired by the project's contributors and users. I look forward to the community's input on this proposal.