Remove Instruct/Chat versions of models & introduce a new ChatTemplate API, fix Anthropic API #820

Harsha-Nori · 2024-05-13T07:46:55Z

This PR should significantly reduce the number of user-facing classes we have in Guidance, and reduce subtle bugs introduced by using a mis-specified Chat Template (models currently silently default to the ChatML syntax, which many of the latest models don't adhere to). It should also make it easier for users to add new models to guidance, either via PR or in their own codebases.

Before:

from guidance.models.transformers import Transformers, TransformersChat

class Llama(Transformers):
    pass

# Users have to do this for most chat models in guidance
class LlamaChat(TransformersChat, Llama):
    def get_role_start(self, role_name, **kwargs):
        if role_name == "system":
            return self._system_prefex + "<<SYS>>\n"
        elif role_name == "user":
            if str(self).endswith("\n<</SYS>>\n\n"):
                return ""  
            else:
                return "[INST] "
        else:
            return " "

    def get_role_end(self, role_name=None):
        if role_name == "system":
            return "\n<</SYS>>\n\n"
        elif role_name == "user":
            return " [/INST]"
        else:
            return " "

lm = LlamaChat(path_to_llama)

After:

from guidance.models import Transformers

lm = Transformers(path_to_llama) # automagically works

If you're using a rare model and the auto import doesn't automatically work...

After pt2:

# users can copy paste from https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/tokenizer_config.json#L12 
llama2_template = "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}"

lm = Transformers(path_to_llama, chat_template=llama2_template)

or, in the worst case for maximal robustness and customizability:

from guidance._chat import ChatTemplate, UnsupportedRoleException

class Llama2ChatTemplate(ChatTemplate):
    template_str = llama2_template

    def get_role_start(self, role_name):
        if role_name == "system":
            return "[INST] <<SYS>>\n"
        elif role_name == "user":
            return "<s>[INST]"
        elif role_name == "assistant":
            return " "
        else:
            raise UnsupportedRoleException(role_name, self)
        
    def get_role_end(self, role_name=None):
        if role_name == "system":
            return "\n<</SYS>"
        elif role_name == "user":
            return " [/INST]"
        elif role_name == "assistant":
            return "</s>"
        else:
            raise UnsupportedRoleException(role_name, self)

lm = Transformers(path_to_llama, chat_template=Llama2ChatTemplate)

The first big change is the removal of the Chat and Instruct mixins, and an introduction of a new guidance._chat.ChatTemplate class, which handles the same responsibilities as those mixins used to.

Users can construct a subclass of ChatTemplate and pass it to models with a new chat_template argument (defaulted to None).

The way this works for local models is to leverage the chat_template property in huggingface transformers and llamacpp's GGUF files. When a user tries to load a model, guidance now follows the following order of operations:

See if the user passed in a ChatTemplate -- if so, use that directly.
If the user passed string "chat_template", set that as a "template_str". If the user did not pass in anything, set the "template_str" based on metadata from the huggingface.AutoTokenizer or gguf metadata fields.
Check the template_str against a local cache in guidance which maintains template converters for the most popular models on huggingface/in llama.cpp. We index this cache based on the actual chat_template string, so any model that uses one of these chat templates -- even if it isn't explicitly listed in the documentation -- will automatically load the right guidance class.
If we don't have anything in the cache, try to automatically convert the jinja2 template into the new guidance._chat.ChatTemplate syntax. Warn the user if we attempt this. [NOTE: This is not yet implemented and may come in a future PR.]
Default to the ChatML syntax, with a warning to the user.

Currently this PR updates the following user facing guidance.Models classes:

Transformers (removing TransformersChat and the Llama/Mistral subclasses)
LlamaCpp (removing LlamaCppChat and Mistral subclasses)
Anthropic
OpenAI

For now, Anthropic should be representative of how grammarless classes will work. I wanted to start with OpenAI, but many other guidance.models classes inherit from OpenAI, so I'm saving that for later. Also while I was at it I upgraded the Anthropic class to use their latest SDK, so guidance.models.Anthropic should now work with the latest Claude3 models.

TODO

A decent amount left to do here. In no particular order...

Write out more templates for popular models in the ChatTemplateCache, and also add an alias system so that we can look up models in the cache by common names (e.g. "llama3").
Add a deprecation warning to people trying to use very old models on Anthropic.
Much more testing and documentation. We should, for example, add documentation on how to import/initialize a new ChatTemplate and use it for your own models.
Write the auto-converter from huggingface jinja2 to guidance ChatTemplate. A battery of unit tests here that compare against the original transformers.apply_chat_template method would make this more robust. Can be in a future PR as this is complex logic. A start to this was attempted in Add TransformersChat code to figure out correct role start and end tokens #791 by @ilmarinen, and we could eventually pull this in and expand its coverage.
Probably get rid of the folders in guidance.models.llama_cpp and guidance.models.transformers because we don't need to maintain a bunch of subclasses for them anymore.

Would appreciate any and all feedback, particularly on the logical flow and new user facing (simpler) API. @marcotcr @paulbkoch @slundberg @riedgar-ms @hudson-ai

…ar templates.

…methods. Add some basic tests on caching functionality.

…ort all local models now.

…ew guidance ChatTemplate API to Anthropic models.

Harsha-Nori · 2024-05-13T07:57:11Z

guidance/models/_model.py

-                        for (
-                            byte
-                        ) in (
-                            node.keys()
-                        ):  # we update all the children since the parser knows the full mask
+
+                        # we update all the children since the parser knows the full mask
+                        for byte in node.keys():  


some of these changes are just undoing particularly egregious dissections that black did...e.g. it's wild that this loop iterator was split into 5 lines!

realizing now that it makes this particular PR much harder to parse though, apologies for that!

I'm guessing that that was done by the long trailing comment. That seems to be something that black is not overly keen on

Did you just shift the comment and re-run black ?

Yeah I ended up manually moving trailing comments up a line, but it'd be good to see if there's an automatic way to make black handle it.

Harsha-Nori · 2024-05-13T08:07:44Z

guidance/_chat.py

+class Llama2ChatTemplate(ChatTemplate):
+    # available_roles = ["system", "user", "assistant"]
+    template_str = llama2_template
+
+    def get_role_start(self, role_name):
+        if role_name == "system":
+            return "[INST] <<SYS>>\n"
+        elif role_name == "user":
+            return "<s>[INST]"
+        elif role_name == "assistant":
+            return " "
+        else:
+            raise UnsupportedRoleException(role_name, self)
+
+    def get_role_end(self, role_name=None):
+        if role_name == "system":
+            return "\n<</SYS>"
+        elif role_name == "user":
+            return " [/INST]"
+        elif role_name == "assistant":
+            return "</s>"
+        else:
+            raise UnsupportedRoleException(role_name, self)


this template is likely an oversimplification of the hf template string, I'll need to debug more and extend this.

Don't forget to add the tests you develop during your debugging.

Harsha-Nori · 2024-05-13T08:42:05Z

Hmm, test failures are related to needing to authenticate on huggingface to load some model tokenizers. I added these new tests in to check the ChatTemplateCache. We could disable these tests, or (better) figure out a way to pass a huggingface auth token here....

E Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.json.
E Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted. You must be authenticated to access it.

@riedgar-ms thoughts here?

riedgar-ms · 2024-05-13T15:25:10Z

guidance/_chat.py

+        self._cache = {}
+
+    def __getitem__(self, key):
+        key_compact = key.replace(" ", "")


Minor point: will this collapse multiple consecutive spaces?

nah it's a good catch, I'm actually not sure we need to be doing this. I didn't want minor differences in jinja formats (which I believe are whitespace agnostic for parts of them) to cause different mappings in the cache, but maybe there are places where an extra space is actually a meaningful difference?

That was a question I asked before I caught sight of the actual keys you were using....

guidance/_chat.py

riedgar-ms

Given the changes to Anthropic, does that mean we now have access to API keys which we can use in the builds?

guidance/_chat.py

guidance/models/_anthropic.py

riedgar-ms · 2024-05-13T15:34:47Z

guidance/models/_model.py

-                        for (
-                            byte
-                        ) in (
-                            node.keys()
-                        ):  # we update all the children since the parser knows the full mask
+
+                        # we update all the children since the parser knows the full mask
+                        for byte in node.keys():  


Did you just shift the comment and re-run black ?

tests/models/test_chat_templates.py

Harsha-Nori · 2024-05-13T15:48:36Z

Given the changes to Anthropic, does that mean we now have access to API keys which we can use in the builds?

I just made a personal account and paid for $20 of credits out of pocket to test it...not sure it's wise to depend on that for our CI/CD (especially given how frequently they run :| ).

riedgar-ms · 2024-05-13T17:15:01Z

Given the changes to Anthropic, does that mean we now have access to API keys which we can use in the builds?

I just made a personal account and paid for $20 of credits out of pocket to test it...not sure it's wise to depend on that for our CI/CD (especially given how frequently they run :| ).

I'm quite happy to spend your money.....

riedgar-ms · 2024-05-13T17:16:05Z

Hmm, test failures are related to needing to authenticate on huggingface to load some model tokenizers. I added these new tests in to check the ChatTemplateCache. We could disable these tests, or (better) figure out a way to pass a huggingface auth token here....

E Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.json.
E Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted. You must be authenticated to access it.

@riedgar-ms thoughts here?

You can add the HF auth token to the repo as a secret (like we did for the AzureAI studio secrets), and then access it as an environment variable within the test. I believe I have an env_or_fail() method to help with that.

…o do.

Harsha-Nori · 2024-05-14T06:00:39Z

Hmm, test failures are related to needing to authenticate on huggingface to load some model tokenizers. I added these new tests in to check the ChatTemplateCache. We could disable these tests, or (better) figure out a way to pass a huggingface auth token here....

E Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.json.
E Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted. You must be authenticated to access it.

@riedgar-ms thoughts here?

You can add the HF auth token to the repo as a secret (like we did for the AzureAI studio secrets), and then access it as an environment variable within the test. I believe I have an env_or_fail() method to help with that.

Nice thanks, I changed it to use the env variable and your helper function

… github ci build machines

…e model.

…to be updated

riedgar-ms

Have suggested a way of getting some of those Phi3 tests running in the gates

guidance/_chat.py

riedgar-ms · 2024-05-14T14:29:55Z

guidance/library/_role.py

-
-    lm += lm.get_role_start(role_name, **kwargs)
+
+    # TODO [HN]: Temporary change while I instrument chat_template in transformers only.


Is this still the case (and below)?

riedgar-ms · 2024-05-14T14:34:27Z

tests/models/test_chat_templates.py

+
+    tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, token=hf_token)
+    model_chat_template = tokenizer.chat_template
+    if should_pass:


Would separating things out and using pytest's XFAIL be better here?

riedgar-ms · 2024-05-14T14:36:24Z

tests/models/test_openai.py

-            api_key="blah",
-        )
-        assert isinstance(initialized_model, model_class)
+# This is all redundant with the class unification


Better deleted than commented out

riedgar-ms · 2024-05-14T14:37:41Z

tests/models/test_transformers.py

+    # directly passing in newlines next to special tokens for a tokenizer that does rstrip on those tokens
+    # (like phi-3) will cause a tokenization mismatch issue. 
+    # We're leaving this test in so that we can reliably reproduce and debug this in the future. 
+    with pytest.raises(Exception):


Please can you assert something about the exception?

riedgar-ms · 2024-05-14T14:41:07Z

tests/models/test_transformers.py

+    assert "5" in lm["five"]
+
+
+@pytest.mark.skip("Don't overload the build machines")


I'm getting nervous about the number of tests added which are also skipped. Can you do something like:

guidance/tests/models/test_llama_cpp.py

Lines 10 to 15 in 13270bf

@pytest.fixture(scope="module")

def llamacpp_model(selected_model, selected_model_name):

if selected_model_name in ["hfllama7b", "hfllama_7b_gpu"]:

return selected_model

else:

pytest.skip("Requires Llama-Cpp model")

So that these tests will run whenever there's a Phi3 model active in selected_model ?

guidance/models/_openai.py

codecov-commenter · 2024-05-14T18:38:32Z

Codecov Report

Attention: Patch coverage is 59.49367% with 96 lines in your changes are missing coverage. Please review.

Project coverage is 57.34%. Comparing base (13270bf) to head (57d1c2b).

❗ Current head 57d1c2b differs from pull request most recent head d6c0b38. Consider uploading reports for the commit d6c0b38 to get more accurate results

Files	Patch %	Lines
guidance/_chat.py	63.06%	41 Missing ⚠️
guidance/models/_anthropic.py	0.00%	30 Missing ⚠️
guidance/models/transformers/_transformers.py	21.42%	11 Missing ⚠️
guidance/models/_openai.py	70.00%	6 Missing ⚠️
guidance/models/_model.py	82.14%	5 Missing ⚠️
guidance/library/_role.py	88.88%	1 Missing ⚠️
guidance/models/_grammarless.py	87.50%	1 Missing ⚠️
guidance/models/_remote.py	0.00%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #820      +/-   ##
==========================================
+ Coverage   57.02%   57.34%   +0.32%     
==========================================
  Files          57       56       -1     
  Lines        4193     4194       +1     
==========================================
+ Hits         2391     2405      +14     
+ Misses       1802     1789      -13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

riedgar-ms · 2024-05-14T19:56:38Z

@paulbkoch I think that everything is going to pass eventually on this. Per the discussion thread, I'll have to see about converting some more models over to the new paradigm

…emplate on remote models

Harsha-Nori added 10 commits May 11, 2024 23:25

Undo some of the more egregious formatting changes made by black

363e73c

Refactor chat template representation and introduce a cache for popul…

15183a3

…ar templates.

Updating chat template logic.

1dcf00a

Change internal ChatTemplateCache to be a singleton instead of class …

2818f8b

…methods. Add some basic tests on caching functionality.

Begin process of removing Chat versions of Transformers models.

fab8d14

Add llamacpp support and fix bugs in transformers models. Should supp…

7c11167

…ort all local models now.

Remove references to separate chat based local models.

cc7c4d1

Remove llama subclass of Transformers as this now works out of the box.

23fe8f3

Also strip newly deleted classes from __init__ py files.

b792490

Refactor Anthropic models to use their new API, and add support for n…

3e9b20b

…ew guidance ChatTemplate API to Anthropic models.

Harsha-Nori requested review from slundberg, paulbkoch and riedgar-ms May 13, 2024 07:46

Harsha-Nori commented May 13, 2024

View reviewed changes

riedgar-ms reviewed May 13, 2024

View reviewed changes

guidance/_chat.py Show resolved Hide resolved

riedgar-ms reviewed May 13, 2024

View reviewed changes

Harsha-Nori added 4 commits May 13, 2024 12:16

Fix Phi-3 in chat mode. Add tests for some future debugging we need t…

b9ae95c

…o do.

bugfix for llama3 tokenizer.

80068b6

change chat_template_cache test to use env var for github ci

1dd2f43

refactor test_chat_templates to use pytest.mark.parametrize

f2ea53b

Harsha-Nori added 3 commits May 13, 2024 23:02

skip tests on bigger models used for local debugging that wont fit on…

410a17e

… github ci build machines

add exception handling if someone passes a llamacpptokenizer to remot…

56a8e93

…e model.

openAI now works with new interface, azureopenai and togetherai need …

98846d7

…to be updated

Harsha-Nori added 5 commits May 14, 2024 02:50

WIP support for AzureOpenAI (which is currently down so I can't test it)

ac931b6

remove references to deleted classes.

9232667

Update ci_tests.yml

b230435

bugfix for mistral7b chattemplatecache

b5f9a0a

Merge branch 'nochat' of github.com:guidance-ai/guidance into nochat

c1bb15c

riedgar-ms mentioned this pull request May 14, 2024

adding support for baidu qianfan and Ernie #823

Open

riedgar-ms reviewed May 14, 2024

View reviewed changes

ibehnam mentioned this pull request May 14, 2024

Llama-3 Chat Template Support? #824

Closed

riedgar-ms reviewed May 14, 2024

View reviewed changes

guidance/models/_openai.py Outdated Show resolved Hide resolved

riedgar-ms added 9 commits May 14, 2024 12:07

Merge remote-tracking branch 'upstream/main' into nochat

be7e955

Copy/paste fix

2fb6fc0

Refactor generator

19a3547

Update class checks

d29a853

Test fixing

b01b95a

Want to look at env variables

bdd56c7

Extra completion model

18f4109

Fix name

7859636

Credentials required

8b54aa8

Some more test work

57d1c2b

paulbkoch and others added 2 commits May 14, 2024 14:54

add a tokenizer to the RemoteEngine class to support a default chat_t…

5d6c660

…emplate on remote models

Missing condition

d6c0b38

paulbkoch merged commit a75896a into main May 15, 2024
126 checks passed

Harsha-Nori changed the title ~~[WIP] Remove Instruct/Chat versions of models & introduce a new ChatTemplate API, fix Anthropic API~~ Remove Instruct/Chat versions of models & introduce a new ChatTemplate API, fix Anthropic API May 18, 2024

Harsha-Nori mentioned this pull request May 18, 2024

Adding Caching to OpenAI chat #603

Merged

x3haloed mentioned this pull request Jun 2, 2024

Add completion and instruct engines for GoogleAI #822

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Instruct/Chat versions of models & introduce a new ChatTemplate API, fix Anthropic API #820

Remove Instruct/Chat versions of models & introduce a new ChatTemplate API, fix Anthropic API #820

Harsha-Nori commented May 13, 2024 •

edited

Loading

Harsha-Nori May 13, 2024

Harsha-Nori May 13, 2024

riedgar-ms May 13, 2024

riedgar-ms May 13, 2024

Harsha-Nori May 13, 2024

Harsha-Nori May 13, 2024

riedgar-ms May 13, 2024

Harsha-Nori commented May 13, 2024 •

edited

Loading

riedgar-ms May 13, 2024

Harsha-Nori May 13, 2024

riedgar-ms May 13, 2024

riedgar-ms left a comment

riedgar-ms May 13, 2024

Harsha-Nori commented May 13, 2024

riedgar-ms commented May 13, 2024

riedgar-ms commented May 13, 2024

Harsha-Nori commented May 14, 2024

riedgar-ms left a comment

riedgar-ms May 14, 2024

riedgar-ms May 14, 2024

riedgar-ms May 14, 2024

riedgar-ms May 14, 2024

riedgar-ms May 14, 2024

codecov-commenter commented May 14, 2024 •

edited

Loading

riedgar-ms commented May 14, 2024


		lm += lm.get_role_start(role_name, **kwargs)

		# TODO [HN]: Temporary change while I instrument chat_template in transformers only.

		assert "5" in lm["five"]


		@pytest.mark.skip("Don't overload the build machines")

	@pytest.fixture(scope="module")
	def llamacpp_model(selected_model, selected_model_name):
	if selected_model_name in ["hfllama7b", "hfllama_7b_gpu"]:
	return selected_model
	else:
	pytest.skip("Requires Llama-Cpp model")

Remove Instruct/Chat versions of models & introduce a new ChatTemplate API, fix Anthropic API #820

Remove Instruct/Chat versions of models & introduce a new ChatTemplate API, fix Anthropic API #820

Conversation

Harsha-Nori commented May 13, 2024 • edited Loading

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Harsha-Nori commented May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riedgar-ms left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Harsha-Nori commented May 13, 2024

riedgar-ms commented May 13, 2024

riedgar-ms commented May 13, 2024

Harsha-Nori commented May 14, 2024

riedgar-ms left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 14, 2024 • edited Loading

Codecov Report

riedgar-ms commented May 14, 2024

Harsha-Nori commented May 13, 2024 •

edited

Loading

Harsha-Nori commented May 13, 2024 •

edited

Loading

codecov-commenter commented May 14, 2024 •

edited

Loading