fix: need to pass skip_prepare_dataset for pretokenized dataset due to breaking change in HF SFTTrainer #326

HarikrishnanBalagopal · 2024-09-02T08:59:49Z

Description of the change

Fix test failure

Related issue number

Fixes #324 (comment)

How to verify the PR

make test

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

…o breaking change in HF SFTTrainer Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>

…ed dataset tests Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>

willmj · 2024-09-03T16:44:30Z

This seems to be the line in the SFTTrainer code that caused the error on our side

willmj

LGTM! I'll try to get another set of eyes on this before merging though.

anhuong · 2024-09-03T17:49:12Z

The issue is that trl created a release 5 days ago that our PRs are now picking up. v0.10.1 includes this line change https://github.com/huggingface/trl/blob/d57e4b726561e5ae58fdc335f34029052944a4a3/trl/trainer/sft_trainer.py#L348 where if packing=False (set by default in our code) then it expects that data preparation must be completed. For pretokenized datasets, data preparation is not needed, thus we need to set skip_prepare_dataset=False. For the other dataset formats we accept, dataset_text_field is set and thus data processing is run in SFTTrainer.

Angel had the idea that we could only add skip_prepare_dataset=False in our pretokenized dataset tests only. I don't think we could as this is not something the user can set, this is something we need to pass into SFTTrainer for all pretokenized datasets.

…o breaking change in HF SFTTrainer (foundation-model-stack#326) * fix: need to pass skip_prepare_dataset for pretokenized dataset due to breaking change in HF SFTTrainer Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> * fix: wrong dataset paths, was using non-tokenized data in pre-tokenized dataset tests Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> --------- Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

fix: need to pass skip_prepare_dataset for pretokenized dataset due t…

654bbf1

…o breaking change in HF SFTTrainer Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>

HarikrishnanBalagopal requested review from anhuong, Ssukriti and alex-jw-brooks as code owners September 2, 2024 08:59

HarikrishnanBalagopal force-pushed the fix/testfailure branch from 2d64294 to d543f79 Compare September 2, 2024 10:20

fix: wrong dataset paths, was using non-tokenized data in pre-tokeniz…

b6fc949

…ed dataset tests Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>

HarikrishnanBalagopal force-pushed the fix/testfailure branch from d543f79 to b6fc949 Compare September 2, 2024 10:29

willmj self-requested a review September 3, 2024 16:44

willmj approved these changes Sep 3, 2024

View reviewed changes

anhuong approved these changes Sep 3, 2024

View reviewed changes

anhuong merged commit 05b4003 into foundation-model-stack:main Sep 3, 2024
7 checks passed

willmj mentioned this pull request Oct 1, 2024

fix: Revert changes from PR#326 with trl version change #360

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: need to pass skip_prepare_dataset for pretokenized dataset due to breaking change in HF SFTTrainer #326

fix: need to pass skip_prepare_dataset for pretokenized dataset due to breaking change in HF SFTTrainer #326

HarikrishnanBalagopal commented Sep 2, 2024 •

edited

Loading

willmj commented Sep 3, 2024 •

edited

Loading

willmj left a comment •

edited

Loading

anhuong commented Sep 3, 2024 •

edited

Loading

fix: need to pass skip_prepare_dataset for pretokenized dataset due to breaking change in HF SFTTrainer #326

fix: need to pass skip_prepare_dataset for pretokenized dataset due to breaking change in HF SFTTrainer #326

Conversation

HarikrishnanBalagopal commented Sep 2, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

willmj commented Sep 3, 2024 • edited Loading

willmj left a comment • edited Loading

Choose a reason for hiding this comment

anhuong commented Sep 3, 2024 • edited Loading

HarikrishnanBalagopal commented Sep 2, 2024 •

edited

Loading

willmj commented Sep 3, 2024 •

edited

Loading

willmj left a comment •

edited

Loading

anhuong commented Sep 3, 2024 •

edited

Loading