Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: need to pass skip_prepare_dataset for pretokenized dataset due to breaking change in HF SFTTrainer #326

Merged

Conversation

HarikrishnanBalagopal
Copy link
Contributor

@HarikrishnanBalagopal HarikrishnanBalagopal commented Sep 2, 2024

Description of the change

Fix test failure

Related issue number

Fixes #324 (comment)

How to verify the PR

make test

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

…o breaking change in HF SFTTrainer

Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>
…ed dataset tests

Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>
@willmj
Copy link
Collaborator

willmj commented Sep 3, 2024

This seems to be the line in the SFTTrainer code that caused the error on our side

Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll try to get another set of eyes on this before merging though.

@anhuong
Copy link
Collaborator

anhuong commented Sep 3, 2024

The issue is that trl created a release 5 days ago that our PRs are now picking up. v0.10.1 includes this line change https://github.com/huggingface/trl/blob/d57e4b726561e5ae58fdc335f34029052944a4a3/trl/trainer/sft_trainer.py#L348 where if packing=False (set by default in our code) then it expects that data preparation must be completed. For pretokenized datasets, data preparation is not needed, thus we need to set skip_prepare_dataset=False. For the other dataset formats we accept, dataset_text_field is set and thus data processing is run in SFTTrainer.

Angel had the idea that we could only add skip_prepare_dataset=False in our pretokenized dataset tests only. I don't think we could as this is not something the user can set, this is something we need to pass into SFTTrainer for all pretokenized datasets.

@anhuong anhuong merged commit 05b4003 into foundation-model-stack:main Sep 3, 2024
7 checks passed
aluu317 pushed a commit to aluu317/fms-hf-tuning that referenced this pull request Sep 13, 2024
…o breaking change in HF SFTTrainer (foundation-model-stack#326)

* fix: need to pass skip_prepare_dataset for pretokenized dataset due to breaking change in HF SFTTrainer

Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>

* fix: wrong dataset paths, was using non-tokenized data in pre-tokenized dataset tests

Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>

---------

Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>
Signed-off-by: Angel Luu <angel.luu@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants