Integration of Jinja2 Templating #875

teleprint-me · 2023-11-05T21:58:18Z

Refactor Chat Templating to Utilize Jinja2

Overview

This pull request introduces a significant refactor of the chat templating system within the llama-cpp-python project. The primary objective is to simplify template management, enhance flexibility, and minimize dependencies by leveraging Jinja2's templating engine.

Changes Introduced

Jinja2 Templating: Shifted the templating mechanism to use Jinja2, which allows for more sophisticated templating features such as inheritance, macros, and custom filters while maintaining simplicity.
Template Customization: Users can now specify a custom template or fallback to a default provided template, giving them the flexibility to tailor the chat output to their preferences.
Reduced Complexity: The codebase simplification removes previously excessive dependencies and streamlines the chat formatting process.

Code Snippet

class AutoChatFormatter(ChatFormatterInterface):
    def __init__(
        self,
        template: Optional[str] = None,
        template_class: Optional[Template] = None,
    ):
        if template is not None:
            self._template = template
        else:
            self._template = llama2_template  # default template
        
        self._renderer = jinja2.Environment(
            loader=jinja2.BaseLoader(),
            trim_blocks=True,
            lstrip_blocks=True,
        ).from_string(
            self._template,
            template_class=template_class,
        )
    
    def __call__(
        self,
        messages: List[Dict[str, str]],
        **kwargs: Any,
    ) -> ChatFormatterResponse:
        formatted_sequence = self._renderer.render(messages=messages, **kwargs)
        return ChatFormatterResponse(prompt=formatted_sequence)

    @property
    def template(self) -> str:
        return self._template

Benefits

Consistency: Provides a consistent templating mechanism for formatting chat messages.
Performance: Reduces the overhead of managing templates and potentially improves the performance due to a simpler processing pipeline.
User Experience: Offers an improved experience for developers who are already familiar with Jinja2's syntax and behavior.
Local and Remote Template Support: Enhances the system's ability to reference both local and remote templates, thus improving usability and flexibility.

Discussion Points

How we can further optimize the templating engine for our use case.
Potential additional features or customizations we might want to include in the future.
Feedback on the default template structure and suggestions for improvements.

Conclusion

This update aims to strike an ideal balance between the sophistication of the templating features and the maintenance simplicity desired by the project's contributors and users. I look forward to the community's input on this proposal.

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

- Simplify the `llama2_template` in `llama_jinja_format.py` by removing unnecessary line breaks for readability without affecting functionality. - Update `ChatFormatterInterface` constructor to accept a more generic `Optional[object]` type for the template parameter, enhancing flexibility. - Introduce a `template` property to `ChatFormatterInterface` for standardized access to the template string. - Replace `MetaSingleton` metaclass with `Singleton` for the `ChatFormatterFactory` to streamline the singleton implementation. These changes enhance code readability, maintain usability, and ensure consistency in the chat formatter's design pattern usage.

earonesty · 2023-11-06T15:55:49Z

this looks very clean/simple. and flexible enough to handle everything. and if we allow the user to include a template "as a chat format", in addition to registering names, then it can accommodate things we haven't thought of.

abetlen · 2023-11-21T17:58:06Z

Hey @teleprint-me this looks great! I think with the merging of ggml-org/llama.cpp#4125 we can now use your approach to automagically get a chat format without having to rely only on presets (assuming the chat formats are included for new quantized models).

teleprint-me · 2023-11-21T23:12:50Z

@abetlen

I left it separated to avoid conflicts with any changes you implemented.

How would you like to handle it?

abetlen · 2023-11-22T11:18:26Z

@teleprint-me next steps I see:

Include jinja2 dependency in pyproject.toml
Change the behaviour of chat handler selection, order should be
- Use chat_handler if set, else
- Use chat_format if set (should be changed to default None in Llama.init), else
- Use chat format from gguf file if exists, else
- Use llama2 chat format

I think this order makes the most sense and avoids breaking backwards compatibility.

teleprint-me · 2023-11-22T16:08:44Z

@abetlen

Sounds good!

I'll do my best to get around to it over the weekend.

I haven't had as much free time lately, but this is a high priority for me.

I'll see what I can do and keep you in the loop.

Let me know if anything changes in the mean time.

abetlen · 2023-11-22T22:38:26Z

@teleprint-me thank you so much, I think this will be very helpful for a lot of people, let me know if you need any extra help!

In terms of conflicts there shouldn't be any I'm just working on some performance features for batching / speculative decoding and that should all be independant of the chat format work.

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

…ormers compatibility Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

antcodd · 2023-11-27T07:02:06Z

It's good to see this supporting multiple system messages, that's one limitation of the current templates. Some clients insert system messages at positions other than the start to provide instructions when using the chat completion API (e.g. SillyTavern, and the Herika Skyrim mod).

Will this PR replace the existing templates? I fixed the system role mapping for one existing format locally but don't want to create unnecessary conflicts. I definitely think Jinja templates are a good way to go.

Function templates might also be worth considering for the future, at least for what is passed to the model. The current system for that is fairly complicated.

teleprint-me · 2023-11-27T17:17:19Z

@antcodd

If I understood correctly, no, it will not replace the currently existing templates seeing as @abetlen is looking to avoid breaking backwards compatibility with the current API.

This makes integration a bit more complicated which is why I'm taking my time with it.

That, and I'm in the middle of a bunch of personal projects, so I only have a limited amount of time to spend on each of them and this isn't including work.

@abetlen

How can I get the metadata from the models gguf file so I can extract the chat template if it exists? I didn't see in the spec... maybe I missed it?

abetlen · 2023-11-27T21:57:24Z

@teleprint-me FWIW I think we can replace the implementation of some of the existing chat formats with this simpler Jinja2 approach, it shouldn't break anything on the API. The multi-modal models and the function calling models don't quite fit into this approach because they're not just using simple prompts but we can tackle those later.

How can I get the metadata from the models gguf file so I can extract the chat template if it exists? I didn't see in the spec... maybe I missed it?

If a model supports it you should just be able to call llama_model_meta_val_str from llama_cpp with key tokenizer.chat_template, ie something like:

buflen = 2048 # should be enough, unsure
buf = (ctypes.c_char * buflen)(0)
llama_cpp.llama_model_meta_val_str(llama.model, b"tokenizer.chat_template", buf, ctypes.sizeof(buf))

Unfortunately I just checked and it doesn't look to be too popular yet, in theory https://huggingface.co/TheBloke/OpenHermes-2-Mistral-7B-GGUF should support it but I believe the gguf files where generated with a llama.cpp version before the PR was merged. One way to test would be if someone re-quantized from the base model which supports a chat_template in it's tokenizer_config.json https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B/blob/main/tokenizer_config.json

teleprint-me · 2023-11-28T03:58:33Z

Exploring `llama.cpp` and `llama-cpp-python` for Metadata Extraction

Objective: The main goal was to extract specific metadata from GGUF-formatted models using the llama.cpp and llama-cpp-python libraries, with a focus on keys like tokenizer.chat_template and tokenizer.huggingface.json.

Key Steps and Discoveries:

Initial Experiments:
- Utilized llama_cpp.llama_model_meta_val_str to extract metadata from models, focusing on keys such as general.name and tokenizer.ggml.model.
- Discoveries: Some keys, notably tokenizer.huggingface.json, returned -1, suggesting their absence or non-implementation in tested models.
Deep Dive into Source Code:
- Investigated the C++ and Python code of llama.cpp and llama-cpp-python for insights into metadata management and extraction.
- Notable Findings:
  - Examined structures like llama_model, llama_context, and gguf_context.
  - Identified LLM_KV_NAMES in llama.cpp, mapping enums to string metadata keys.
  - Confirmed the non-implementation of tokenizer.huggingface.json in tested models.
Documentation of Findings:
- Uncovered the complexities in handling metadata extraction from GGUF files.
- Emphasized the necessity of a comprehensive conversion script for model metadata inclusion.

Challenges:

Absence of Key Metadata: Targeted metadata keys (tokenizer.chat_template and tokenizer.huggingface.json) were not found in tested models.
Complexity of Source Code: Navigating intricate C++ and Python code presented challenges in understanding metadata handling.

Future Steps:

Conversion Script Development: Aim to create a script for converting models to include desired metadata, vital for requantizing models like openhermes and deepseek.
Enhancing Conversion Script: Focus on improving the script's ability to handle gguf model file metadata effectively, given its potential impact on the project.
Backend Integration with Llama-cpp-python: Concentrate on integrating Jinja2 templates with the backend to streamline the chat template system, which may offer a more immediate solution.

Tasks

Include jinja2 Dependency in pyproject.toml:
- Status: ✅ Completed
- Successfully added jinja2 as a dependency in pyproject.toml. Adjusted the initial version constraint from >=2.11.3,<3.0.0 to >=2.11.3 to resolve a dependency conflict with mkdocs-material. This adjustment was thoroughly tested, and the project environment was updated accordingly.
Modify Chat Handler Selection Logic:
- Objective: Implement a new chat handler selection process as outlined by the project owner. The updated order should prioritize chat_handler if set, followed by chat_format (defaulting to None), then the chat format from the .gguf file, and finally defaulting to the llama2 chat format.
- Status: 🚧 In Progress / To Do
- This task involves modifying the existing code to adhere to the new selection criteria. It will require checking the specified conditions in order and applying the appropriate chat format based on the available settings or file configurations.

- Changed attribute name from `self._renderer` to `self._environment`

lopagela · 2023-12-02T18:49:01Z

Hey guys, thanks a lot for working on this 🙏

While doing my investigation for this feature in https://github.com/imartinez/privateGPT I also came to read the parsing of the metadata of the GGUF files, but as well as it's creation, in the gguf-py part of llama.cpp.

While reading their implementation, I found out that tokenizer.chat_template is being added (see the usage of this enum https://github.com/ggerganov/llama.cpp/blob/37c746d687d877bc11803e96b4dc5f378b83c0a0/gguf-py/gguf/constants.py#L73) in new GGUF files (the GGUF version hasn't been increased, but we should see them in gguf files with v3 and onwards.

Looking at the usages of tokenizer.huggingface.json in the GGUF files writer, it seems it is simply not used - maybe a relic from the past.

Is there a discord or something like this for llama-cpp-python? If yes, I'd be interested in joining it ✌️

Do not hesitate to come and say hi at Private's GPT one: (channel #contributors)

lopagela · 2023-12-02T18:57:46Z

llama_cpp/llama_jinja_format.py

+        messages: List[Dict[str, str]],
+        **kwargs: Any,
+    ) -> ChatFormatterResponse:
+        formatted_sequence = self._environment.render(messages=messages, **kwargs)


During my exploration on existing chat_template, I found out that, usually, they are using functions, such as raise_exception.

It looks like their might be some elegant solutions to define such method, leveraging the Jinja env (see https://stackoverflow.com/a/29262304).

Otherwise, I guess you can heavily inspire yourself from HF's transformers implementation (c.f. the usage guide: https://huggingface.co/docs/transformers/main/chat_templating) of AutoTokenizer.from_pretrained("xxx/model-name").apply_chat_template(chat, tokenize=False)

Examples of chat_templates:

https://huggingface.co/bofenghuang/vigogne-2-7b-chat/blob/main/tokenizer_config.json

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/blob/main/tokenizer_config.json

https://huggingface.co/HuggingFaceH4/zephyr-7b-beta/blob/main/tokenizer_config.json

And here is a nice entry-point line in transformers to follow to see how they are rendering this jinja template (I basically did a Ctrl + F to find it): https://github.com/huggingface/transformers/blob/74a3cebfa51b539bfcfa79b33686cc090b7074e8/src/transformers/tokenization_utils_base.py#L1600

teleprint-me · 2023-12-14T04:13:10Z

@abetlen

I feel like I'm holding this PR hostage and I haven't had time to dig into to it. It technically needs to be integrated, so I'm going to allow it for review so you can do what you need to do. Let me know if you need anything.

The chat templates should be integrated into the latest gguf's. I've been testing Mixtral and they show up during the conversion process when the config.json is available.

teleprint-me · 2023-12-15T06:12:03Z

Thought I'd add this here because I was experimenting with the latest llama.cpp updates.

The new convert.py script has the Keys.Tokenizer values baked into it now.

The new gguf package makes it a lot easier to extract the templates from the models metadata.

The following is an example on how to do it at a high-level.

"""
main.py - example file to experiment with extracting the chat template from the models metadata
"""
from __future__ import annotations

from gguf import GGUFReader, Keys


def get_chat_template(model_file: str) -> str:
    reader = GGUFReader(model_file)

    # Access the 'chat_template' field directly using its key
    chat_template_field = reader.fields[Keys.Tokenizer.CHAT_TEMPLATE]

    # Extract the chat template string from the field
    chat_template_memmap = chat_template_field.parts[-1]
    chat_template_string = chat_template_memmap.tobytes().decode("utf-8")

    return chat_template_string


def main() -> None:
    # this is just an exercise to determine how it might be done in practice
    model_file = "models/mistralai/Mixtral-8x7B-Instruct-v0.1/Mixtral-8x7B-Instruct-v0.1-q4_0.gguf"
    chat_template = get_chat_template(model_file)
    print(chat_template)


if __name__ == "__main__":
    main()

Which results in:

00:49:13 | ~/Local/llama_cpp_client
(.venv) git:(main | Δ) λ python main.py 
{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

So, llama_cpp.llama_model_meta_val_str should work now, but this assumes the gguf model file has the embedded metadata.

abetlen · 2024-01-17T20:57:52Z

Hey @teleprint-me thanks for all the work on this, I've merged it in now. I still need to do a little bit to adapt this to the CompletionChatHandler interface so we can have a single method for adding new chat formats using jinja, will likely migrate the existing templates over as well and add support for auto loading from the gguf files but I think this will be really useful.

feat: Add support for jinja templating

df97e8e

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

teleprint-me mentioned this pull request Nov 5, 2023

Draft: Refactor Chat Formatter for Enhanced Flexibility and Extensibility #809

Closed

teleprint-me added 2 commits November 5, 2023 17:24

Merge branch 'main' into jinja2-templates

d9544d1

abetlen force-pushed the main branch 2 times, most recently from 8c93cf8 to cc0fe43 Compare November 14, 2023 20:24

Merge branch 'main' into jinja2-templates

101f5f2

Merge branch 'abetlen:main' into jinja2-templates

7ebbd8d

teleprint-me added 4 commits November 26, 2023 17:36

Merge branch 'abetlen:main' into jinja2-templates

db909e6

Add outline for Jinja2 templating integration documentation

72b7e1f

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

Add jinja2 as a dependency with version range for Hugging Face transf…

a42042a

…ormers compatibility Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

Update jinja2 version constraint for mkdocs-material compatibility

d03eb84

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

teleprint-me mentioned this pull request Nov 30, 2023

server : improvements and maintenance ggml-org/llama.cpp#4216

Open

10 tasks

teleprint-me added 2 commits November 29, 2023 23:19

Fix attribute name in AutoChatFormatter

e5d18ce

- Changed attribute name from `self._renderer` to `self._environment`

Merge branch 'abetlen:main' into jinja2-templates

7c30c2e

lopagela reviewed Dec 2, 2023

View reviewed changes

teleprint-me added 2 commits December 11, 2023 12:25

Merge branch 'abetlen:main' into jinja2-templates

caae414

Merge branch 'main' into jinja2-templates

49dcd51

teleprint-me marked this pull request as ready for review December 14, 2023 04:13

abetlen merged commit 6bfe98b into abetlen:main Jan 17, 2024

Integration of Jinja2 Templating #875

Integration of Jinja2 Templating #875

Uh oh!

Conversation

teleprint-me commented Nov 5, 2023

Refactor Chat Templating to Utilize Jinja2

Overview

Changes Introduced

Code Snippet

Benefits

Discussion Points

Conclusion

Uh oh!

earonesty commented Nov 6, 2023

Uh oh!

abetlen commented Nov 21, 2023

Uh oh!

teleprint-me commented Nov 21, 2023

Uh oh!

abetlen commented Nov 22, 2023

Uh oh!

teleprint-me commented Nov 22, 2023

Uh oh!

abetlen commented Nov 22, 2023

Uh oh!

antcodd commented Nov 27, 2023

Uh oh!

teleprint-me commented Nov 27, 2023

Uh oh!

abetlen commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teleprint-me commented Nov 28, 2023

Exploring llama.cpp and llama-cpp-python for Metadata Extraction

Key Steps and Discoveries:

Challenges:

Future Steps:

Tasks

Uh oh!

lopagela commented Dec 2, 2023

Uh oh!

lopagela Dec 2, 2023

Choose a reason for hiding this comment

Uh oh!

lopagela Dec 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

teleprint-me commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teleprint-me commented Dec 15, 2023

Uh oh!

abetlen commented Jan 17, 2024

Uh oh!

Uh oh!

abetlen commented Nov 27, 2023 •

edited

Loading

Exploring `llama.cpp` and `llama-cpp-python` for Metadata Extraction

lopagela Dec 2, 2023 •

edited

Loading

teleprint-me commented Dec 14, 2023 •

edited

Loading