Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add support for tokenizer’s or custom jinja chat_template #1970

Conversation

NanoCode012
Copy link
Collaborator

Continues #1732 due to no write access to PR (with prior author consent)

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

@chiragjn

chiragjn and others added 30 commits July 12, 2024 08:42
Summary of changes:

1. Adds `tokenizer_default` as option for `chat_template` in
   `chat_template` prompt strategy that allows using the chat template
   from tokenizer's config.json
2. Allows falling back to chat templates available in axolotl if
   tokenizer does not have a chat template
3. Adds a mistral chat template which supports system message - taken
   from https://github.com/chujiezheng/chat_templates/blob/main/chat_templates/mistral-instruct.jinja

---

Why?

Many popular models are not trained with chatml format. As a result for
the model to correctly learn chatml we have to turn on train_on_inputs
which requires more compute and time. If we can use the model's already
learned chat template we can just learn the output tokens

---

Todo:

- Write tests
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
@@ -53,7 +53,7 @@ def transform_fn(sample, tokenizer=None):
"role": role_map[sample[field_rejected][field_message_role]],
"content": sample[field_rejected][field_message_content],
}
dummy_user_message = {"role": "user", "content": "dummy"}
dummy_user_message = {"role": "user", "content": "[[dummy_message]]"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NanoCode012 why does this need to be a string with double "[[ ... ]]" ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose this arbitrary symbol following suggestions from levy from the earlier PR.

It's just to prevent any matching with the line below (result["chosen"].find(chosen["content"]) on the off chance someone actually has a chat saying "dummy").

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a dummy user message here? shouldn't this be from the DPO dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an edge case where if the chat_template asserts that the first message must be the user like in Phi3's case, the line below would fail as the first role would be assistant. We add the dummy user message, so that the chat template would run fine. Afterwards, the dummy message would be stripped when find is used.

        result["chosen"] = tokenizer.apply_chat_template(


             [chosen], # without the dummy user message
             add_generation_prompt=False,
             chat_template=chat_template_string,
             tokenize=False,
         )

        chosen_strip_index = result["chosen"].find(chosen["content"])

I added a test to also check for this here:

class TestAssistantDPOChatTemplatePhi3:
"""
Test class for assistant style datasets with phi-3 prompts using the tokenizer's chat_template strategy.
"""
def test_phi3_defaults(self, phi3_tokenizer, assistant_dataset):
# pylint: disable=duplicate-code
transform_fn = default(
DictDefault(
{
"chat_template": "tokenizer_default",
"datasets": [
{
"type": "chat_template",
}
],
}
)
)
result = transform_fn(assistant_dataset[0], tokenizer=phi3_tokenizer)
assert result["prompt"] == (
"<|user|>\nhello<|end|>\n"
+ "<|assistant|>\nhello<|end|>\n"
+ "<|user|>\ngoodbye<|end|>\n"
+ "<|assistant|>\n"
)
assert result["chosen"] == "goodbye<|end|>"
assert result["rejected"] == "party on<|end|>"

If you were to remove the dummy message, the test would fail.

Ref: #1732 (comment)

@NanoCode012
Copy link
Collaborator Author

Will do another CI check once base images are updated to ensure it doesn't break anything

@NanoCode012
Copy link
Collaborator Author

New CI passed https://github.com/axolotl-ai-cloud/axolotl/actions/runs/11433735390. Adding ready-to-merge tag.

@chiragjn
Copy link
Contributor

chiragjn commented Oct 28, 2024

Bumping this up @winglian 😅

@NanoCode012 NanoCode012 merged commit bfc77b0 into axolotl-ai-cloud:main Oct 29, 2024
13 checks passed
@NanoCode012 NanoCode012 deleted the cj_tokenizer_default_prompt_template branch October 29, 2024 03:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants