📦 Support for packing tokenized datasets for SFT #2011

kmehant · 2024-09-03T14:04:56Z

What does this PR do?

Fixes #1848

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case. Support packing for pretokenized datasets #1848
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

@qgallouedec

Anyone from the community!

kmehant · 2024-09-22T15:04:55Z

Anyone having bandwidth, requesting review thank you - @qgallouedec @lewtun @kashif @lvwerra or others from community.

Discussion can be seen here - #1848

lewtun

Thanks for the contribution @kmehant ! Overall it looks good to me - would you mind adding an integration test for this scenario?

trl/trainer/sft_trainer.py

trl/trainer/utils.py

HuggingFaceDocBuilderDev · 2024-09-30T07:49:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kmehant · 2024-10-02T06:31:28Z

@lewtun Thank you for your review. I have addressed the review comments and as well added the test cases. Thank you.

kmehant · 2024-10-24T04:45:18Z

@lewtun @qgallouedec apologies for the failing tests before. Tests passing now for me locally! Thank you.

kmehant · 2024-10-29T05:39:56Z

@lewtun @qgallouedec rebased looking for escalation, thanks.

kmehant · 2024-11-12T18:17:39Z

@lewtun @qgallouedec pulse pinging for review and merge, thank you.

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant · 2024-11-25T06:14:22Z

@lewtun @qgallouedec it will be of great help pushing this forward before it gets into stale state for being open for a long time after having a positive discussion over the issue #1848 with @qgallouedec.

If you would like to tag anyone else from the community who have bandwidth basing on our discussion over the issue (#1848) would be nice!

qgallouedec · 2024-11-25T08:18:10Z

tests/test_sft_trainer.py

+        constant_len_dataset = ConstantLengthDataset(
+            self.tokenizer,
+            self.dummy_tokenized_dataset,
+            dataset_text_field=None,


Suggested change

dataset_text_field=None,

qgallouedec · 2024-11-25T08:24:19Z

tests/test_sft_trainer.py

+                                "content": [
+                                    {
+                                        "type": "text",
+                                        "text": "Oh ye, you are right, what is 1+1",
+                                    }
+                                ],


Suggested change

"content": [

{

"type": "text",

"text": "Oh ye, you are right, what is 1+1",

}

],

"content": [{"type": "text", "text": "Oh ye, you are right, what is 1+1"}],

qgallouedec · 2024-11-25T08:27:32Z

Sorry for the delay. I'm trying to keep track of all these issues but it's not easy.

Lgtm overall, thanks. Just making a few comments

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant · 2024-11-25T08:40:51Z

@qgallouedec Thank you for circling back, I have addressed your comments.

Remove packing check as packing support for pretokenised data is merged to trl. See huggingface/trl#2011 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

* Add initial implementation of dataloader v1 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * tests: reformat mock.patch to inside unit tests Signed-off-by: Will Johnson <mwjohnson728@gmail.com> fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add data config argument to data preprocessor Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * fix: Changes to support current implementation Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Ensure data handling is done within process dataargs Removes unused dead code after adding the new framework and refactors some test cases and files. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Remove accelerator in favor of torch distributed check for multi node data preprocessing Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Refactor data util tests as data handler tests. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * fix: add __init__.py to add tuning.data to python package Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: multi GPU prepare training dataset Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: lint Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: Add TODO Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * test: add test for process_dataset_configs in HFBasedDataPreProcessor Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * add: test cases for framework Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: update function name get_dataprocessor->get_datapreprocessor Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Rename loader to processor Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * data folders should be together Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Add code comments and make code path clearer. Remove packing check as packing support for pretokenised data is merged to trl. See huggingface/trl#2011 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> --------- Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Will Johnson <mwjohnson728@gmail.com> Co-authored-by: Abhishek <maurya.abhishek@ibm.com>

kmehant mentioned this pull request Sep 3, 2024

Support packing for pretokenized datasets #1848

Closed

kmehant force-pushed the pack-pretok branch from 90d5236 to 9ddff15 Compare September 3, 2024 14:12

kmehant changed the title ~~feat: add support for packing tokenized datasetS~~ feat: add support for packing tokenized datasets Sep 3, 2024

kmehant force-pushed the pack-pretok branch from 9ddff15 to d7a4ca9 Compare September 9, 2024 06:07

kmehant force-pushed the pack-pretok branch from d7a4ca9 to f6dedb5 Compare September 17, 2024 08:04

lewtun reviewed Sep 23, 2024

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

trl/trainer/utils.py Outdated Show resolved Hide resolved

trl/trainer/utils.py Outdated Show resolved Hide resolved

kmehant force-pushed the pack-pretok branch from f6dedb5 to 004f128 Compare September 28, 2024 14:37

kmehant force-pushed the pack-pretok branch from 004f128 to dffcda7 Compare October 2, 2024 06:16

kmehant requested a review from lewtun October 2, 2024 06:38

kmehant force-pushed the pack-pretok branch 4 times, most recently from 0d0f34c to c689f24 Compare October 9, 2024 04:26

kmehant force-pushed the pack-pretok branch from c689f24 to 817c933 Compare October 10, 2024 05:09

kmehant force-pushed the pack-pretok branch 2 times, most recently from 95113ee to 25b997d Compare October 24, 2024 04:23

kmehant force-pushed the pack-pretok branch from 25b997d to e3b6976 Compare October 29, 2024 05:39

kmehant force-pushed the pack-pretok branch 3 times, most recently from b13e324 to 3e9110b Compare November 8, 2024 05:33

kmehant force-pushed the pack-pretok branch from 3e9110b to 16466bd Compare November 12, 2024 18:16

kmehant force-pushed the pack-pretok branch from 16466bd to d5c3e28 Compare November 20, 2024 15:33

kmehant added 2 commits November 25, 2024 11:41

feat: add support for packing tokenized datasetS

c4508dd

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

fix: address review comments

5167c51

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the pack-pretok branch from d5c3e28 to 32e42de Compare November 25, 2024 06:11

qgallouedec reviewed Nov 25, 2024

View reviewed changes

feat: add tests for pretokenized dataset packing

685009f

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the pack-pretok branch from 32e42de to 685009f Compare November 25, 2024 08:39

kmehant requested a review from qgallouedec November 25, 2024 08:44

qgallouedec approved these changes Nov 25, 2024

View reviewed changes

qgallouedec changed the title ~~feat: add support for packing tokenized datasets~~ 📦 Support for packing tokenized datasets for SFT Nov 25, 2024

qgallouedec merged commit 17e8060 into huggingface:main Nov 25, 2024
13 checks passed

kmehant mentioned this pull request Nov 26, 2024

feat: DataProcessor v1 foundation-model-stack/fms-hf-tuning#381

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📦 Support for packing tokenized datasets for SFT #2011

📦 Support for packing tokenized datasets for SFT #2011

kmehant commented Sep 3, 2024 •

edited

Loading

kmehant commented Sep 22, 2024

lewtun left a comment

HuggingFaceDocBuilderDev commented Sep 30, 2024

kmehant commented Oct 2, 2024

kmehant commented Oct 24, 2024

kmehant commented Oct 29, 2024

kmehant commented Nov 12, 2024 •

edited

Loading

kmehant commented Nov 25, 2024 •

edited

Loading

qgallouedec Nov 25, 2024

qgallouedec Nov 25, 2024

qgallouedec commented Nov 25, 2024

kmehant commented Nov 25, 2024

📦 Support for packing tokenized datasets for SFT #2011

📦 Support for packing tokenized datasets for SFT #2011

Conversation

kmehant commented Sep 3, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

kmehant commented Sep 22, 2024

lewtun left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 30, 2024

kmehant commented Oct 2, 2024

kmehant commented Oct 24, 2024

kmehant commented Oct 29, 2024

kmehant commented Nov 12, 2024 • edited Loading

kmehant commented Nov 25, 2024 • edited Loading

qgallouedec Nov 25, 2024

Choose a reason for hiding this comment

qgallouedec Nov 25, 2024

Choose a reason for hiding this comment

qgallouedec commented Nov 25, 2024

kmehant commented Nov 25, 2024

kmehant commented Sep 3, 2024 •

edited

Loading

kmehant commented Nov 12, 2024 •

edited

Loading

kmehant commented Nov 25, 2024 •

edited

Loading