-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📦 Support for packing tokenized datasets for SFT #2011
Conversation
d7a4ca9
to
f6dedb5
Compare
Anyone having bandwidth, requesting review thank you - @qgallouedec @lewtun @kashif @lvwerra or others from community. Discussion can be seen here - #1848 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @kmehant ! Overall it looks good to me - would you mind adding an integration test for this scenario?
f6dedb5
to
004f128
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@lewtun Thank you for your review. I have addressed the review comments and as well added the test cases. Thank you. |
0d0f34c
to
c689f24
Compare
95113ee
to
25b997d
Compare
@lewtun @qgallouedec apologies for the failing tests before. Tests passing now for me locally! Thank you. |
@lewtun @qgallouedec rebased looking for escalation, thanks. |
b13e324
to
3e9110b
Compare
@lewtun @qgallouedec pulse pinging for review and merge, thank you. |
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
@lewtun @qgallouedec it will be of great help pushing this forward before it gets into stale state for being open for a long time after having a positive discussion over the issue #1848 with @qgallouedec. If you would like to tag anyone else from the community who have bandwidth basing on our discussion over the issue (#1848) would be nice! |
tests/test_sft_trainer.py
Outdated
constant_len_dataset = ConstantLengthDataset( | ||
self.tokenizer, | ||
self.dummy_tokenized_dataset, | ||
dataset_text_field=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset_text_field=None, |
tests/test_sft_trainer.py
Outdated
"content": [ | ||
{ | ||
"type": "text", | ||
"text": "Oh ye, you are right, what is 1+1", | ||
} | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"content": [ | |
{ | |
"type": "text", | |
"text": "Oh ye, you are right, what is 1+1", | |
} | |
], | |
"content": [{"type": "text", "text": "Oh ye, you are right, what is 1+1"}], |
Sorry for the delay. I'm trying to keep track of all these issues but it's not easy. Lgtm overall, thanks. Just making a few comments |
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
@qgallouedec Thank you for circling back, I have addressed your comments. |
Remove packing check as packing support for pretokenised data is merged to trl. See huggingface/trl#2011 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
Remove packing check as packing support for pretokenised data is merged to trl. See huggingface/trl#2011 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
Remove packing check as packing support for pretokenised data is merged to trl. See huggingface/trl#2011 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
* Add initial implementation of dataloader v1 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * tests: reformat mock.patch to inside unit tests Signed-off-by: Will Johnson <mwjohnson728@gmail.com> fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add data config argument to data preprocessor Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * fix: Changes to support current implementation Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Ensure data handling is done within process dataargs Removes unused dead code after adding the new framework and refactors some test cases and files. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Remove accelerator in favor of torch distributed check for multi node data preprocessing Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Refactor data util tests as data handler tests. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * fix: add __init__.py to add tuning.data to python package Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: multi GPU prepare training dataset Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: lint Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: Add TODO Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * test: add test for process_dataset_configs in HFBasedDataPreProcessor Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * add: test cases for framework Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: update function name get_dataprocessor->get_datapreprocessor Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Rename loader to processor Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * data folders should be together Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Add code comments and make code path clearer. Remove packing check as packing support for pretokenised data is merged to trl. See huggingface/trl#2011 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> --------- Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Will Johnson <mwjohnson728@gmail.com> Co-authored-by: Abhishek <maurya.abhishek@ibm.com>
What does this PR do?
Fixes #1848
Before submitting
Pull Request section?
to it if that's the case. Support packing for pretokenized datasets #1848
documentation guidelines.
Who can review?
@qgallouedec
Anyone from the community!