Feat: Add support for tokenizer’s or custom jinja chat_template #1970

NanoCode012 · 2024-10-14T11:08:30Z

Continues #1732 due to no write access to PR (with prior author consent)

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

@chiragjn

Summary of changes: 1. Adds `tokenizer_default` as option for `chat_template` in `chat_template` prompt strategy that allows using the chat template from tokenizer's config.json 2. Allows falling back to chat templates available in axolotl if tokenizer does not have a chat template 3. Adds a mistral chat template which supports system message - taken from https://github.com/chujiezheng/chat_templates/blob/main/chat_templates/mistral-instruct.jinja --- Why? Many popular models are not trained with chatml format. As a result for the model to correctly learn chatml we have to turn on train_on_inputs which requires more compute and time. If we can use the model's already learned chat template we can just learn the output tokens --- Todo: - Write tests

…olotl into cj_tokenizer_default_prompt_template

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

…amples

…efault_prompt_template Feat: merge latest, update docs, fix dropped config bug, added unit test

…n in dataset section

winglian · 2024-10-14T17:30:53Z

src/axolotl/prompt_strategies/dpo/chat_template.py

@@ -53,7 +53,7 @@ def transform_fn(sample, tokenizer=None):
            "role": role_map[sample[field_rejected][field_message_role]],
            "content": sample[field_rejected][field_message_content],
        }
-        dummy_user_message = {"role": "user", "content": "dummy"}
+        dummy_user_message = {"role": "user", "content": "[[dummy_message]]"}


@NanoCode012 why does this need to be a string with double "[[ ... ]]" ?

I chose this arbitrary symbol following suggestions from levy from the earlier PR.

It's just to prevent any matching with the line below (result["chosen"].find(chosen["content"]) on the off chance someone actually has a chat saying "dummy").

why do we need a dummy user message here? shouldn't this be from the DPO dataset?

There is an edge case where if the chat_template asserts that the first message must be the user like in Phi3's case, the line below would fail as the first role would be assistant. We add the dummy user message, so that the chat template would run fine. Afterwards, the dummy message would be stripped when find is used.

result["chosen"] = tokenizer.apply_chat_template( [chosen], # without the dummy user message add_generation_prompt=False, chat_template=chat_template_string, tokenize=False, ) chosen_strip_index = result["chosen"].find(chosen["content"])

I added a test to also check for this here:

axolotl/tests/prompt_strategies/test_dpo_chat_templates.py

Lines 169 to 197 in 207e762

class TestAssistantDPOChatTemplatePhi3:

"""

Test class for assistant style datasets with phi-3 prompts using the tokenizer's chat_template strategy.

"""

def test_phi3_defaults(self, phi3_tokenizer, assistant_dataset):

# pylint: disable=duplicate-code

transform_fn = default(

DictDefault(

{

"chat_template": "tokenizer_default",

"datasets": [

{

"type": "chat_template",

}

],

}

)

)

result = transform_fn(assistant_dataset[0], tokenizer=phi3_tokenizer)

assert result["prompt"] == (

"<|user|>\nhello<|end|>\n"

+ "<|assistant|>\nhello<|end|>\n"

+ "<|user|>\ngoodbye<|end|>\n"

+ "<|assistant|>\n"

)

assert result["chosen"] == "goodbye<|end|>"

assert result["rejected"] == "party on<|end|>"

If you were to remove the dummy message, the test would fail.

Ref: #1732 (comment)

NanoCode012 · 2024-10-18T08:06:40Z

Will do another CI check once base images are updated to ensure it doesn't break anything

NanoCode012 · 2024-10-21T05:52:51Z

New CI passed https://github.com/axolotl-ai-cloud/axolotl/actions/runs/11433735390. Adding ready-to-merge tag.

chiragjn · 2024-10-28T18:00:09Z

Bumping this up @winglian 😅

chiragjn and others added 30 commits July 12, 2024 08:42

Add tests

4e38cea

Merge branch 'main' into cj_tokenizer_default_prompt_template

99b3bc7

Merge branch 'main' into cj_tokenizer_default_prompt_template

fd7538d

Fix lint and bug post merge from main

34ea51d

Add option chat_template_jinja to provide a jinja template

eb188ac

Merge branch 'main' into cj_tokenizer_default_prompt_template

89f382a

Merge branch 'main' into cj_tokenizer_default_prompt_template

21a2302

Merge branch 'main' into cj_tokenizer_default_prompt_template

2e758ae

remove custom mistral template

6ef76f1

Merge branch 'main' into cj_tokenizer_default_prompt_template

8ee30f5

Merge branch 'main' of https://github.com/OpenAccess-AI-Collective/ax…

4805f3c

…olotl into cj_tokenizer_default_prompt_template

Address review comments and add docs

8a84408

Update docs/dataset-formats/conversation.qmd

efeaa00

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

Merge branch 'main' into cj_tokenizer_default_prompt_template

b1bb2ac

Merge branch 'main' into cj_tokenizer_default_prompt_template

260ca97

fix: set default to tokenizer template

88658c0

Merge branch 'main' into cj_tokenizer_default_prompt_template

b8056d0

chore: remove redundant function

f61e2fc

fix: re-arrange enum declaration position

ed3a33c

fix: refactor artifact left from main merge

203ae28

feat(doc): updated config with chat template options and clarified ex…

6b3cdfd

…amples

chore: clarify doc

b6321d2

chore: added example for non-default template

e5162b7

chore: refactor

dab2590

Merge branch 'main' into cj_tokenizer_default_prompt_template

2038255

fix: test

e3efa29

fix: config being dropped and unittest to catch that

de23dab

chore: lint

21326e4

chore: skip duplicate

7b4b665

NanoCode012 and others added 14 commits October 11, 2024 12:29

Merge branch 'main' into cj_tokenizer_default_prompt_template

3c6a6c6

fix: rename var after merge

ef942b6

feat: add test for levy's dpo case

dd87d8c

Merge pull request axolotl-ai-cloud#7 from NanoCode012/cj_tokenizer_d…

ec57918

…efault_prompt_template Feat: merge latest, update docs, fix dropped config bug, added unit test

Merge branch 'main' into cj_tokenizer_default_prompt_template

82b5dc9

Merge branch 'main' into cj_tokenizer_default_prompt_template

0c32552

fix: remove default setting on edge case where chat template override…

9dfc5fa

…n in dataset section

feat: handle sharegpt deprecation better in docs

24aa6b1

feat: add example using fallback

e5cd55c

feat: handles chat_template requiring specific user/assistant order

d101cfc

fix: update test based on new defaults

17bc4c8

fix: imported name incorrectly updated on merge

4aafb7e

chore: lint

95805cf

fix: update dummy message to prevent potential overlap with real content

7eb62ae

winglian reviewed Oct 14, 2024

View reviewed changes

fix(doc): formatting

207e762

NanoCode012 marked this pull request as ready for review October 15, 2024 10:40

chiragjn mentioned this pull request Oct 15, 2024

Allow using tokenizer's default chat template or pass custom jinja chat template #1732

Closed

fix: update bradleyterry to use new chat_template

28e7e44

winglian approved these changes Oct 18, 2024

View reviewed changes

NanoCode012 added the ready to merge label Oct 21, 2024

chiragjn mentioned this pull request Oct 21, 2024

Using native chat_template from tokenizer config in chat_template strategy #1689

Closed

5 tasks

chiwanpark mentioned this pull request Oct 25, 2024

feat: add Exaone3 chat_template #1995

Merged

NanoCode012 merged commit bfc77b0 into axolotl-ai-cloud:main Oct 29, 2024
13 checks passed

NanoCode012 deleted the cj_tokenizer_default_prompt_template branch October 29, 2024 03:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add support for tokenizer’s or custom jinja chat_template #1970

Feat: Add support for tokenizer’s or custom jinja chat_template #1970

NanoCode012 commented Oct 14, 2024

winglian Oct 14, 2024

NanoCode012 Oct 14, 2024

winglian Oct 15, 2024

NanoCode012 Oct 16, 2024

NanoCode012 commented Oct 18, 2024

NanoCode012 commented Oct 21, 2024

chiragjn commented Oct 28, 2024 •

edited

Loading

	class TestAssistantDPOChatTemplatePhi3:
	"""
	Test class for assistant style datasets with phi-3 prompts using the tokenizer's chat_template strategy.
	"""

	def test_phi3_defaults(self, phi3_tokenizer, assistant_dataset):
	# pylint: disable=duplicate-code
	transform_fn = default(
	DictDefault(
	{
	"chat_template": "tokenizer_default",
	"datasets": [
	{
	"type": "chat_template",
	}
	],
	}
	)
	)
	result = transform_fn(assistant_dataset[0], tokenizer=phi3_tokenizer)
	assert result["prompt"] == (
	"<\|user\|>\nhello<\|end\|>\n"
	+ "<\|assistant\|>\nhello<\|end\|>\n"
	+ "<\|user\|>\ngoodbye<\|end\|>\n"
	+ "<\|assistant\|>\n"
	)
	assert result["chosen"] == "goodbye<\|end\|>"
	assert result["rejected"] == "party on<\|end\|>"

Feat: Add support for tokenizer’s or custom jinja chat_template #1970

Feat: Add support for tokenizer’s or custom jinja chat_template #1970

Conversation

NanoCode012 commented Oct 14, 2024

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

winglian Oct 14, 2024

Choose a reason for hiding this comment

NanoCode012 Oct 14, 2024

Choose a reason for hiding this comment

winglian Oct 15, 2024

Choose a reason for hiding this comment

NanoCode012 Oct 16, 2024

Choose a reason for hiding this comment

NanoCode012 commented Oct 18, 2024

NanoCode012 commented Oct 21, 2024

chiragjn commented Oct 28, 2024 • edited Loading

chiragjn commented Oct 28, 2024 •

edited

Loading