-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: DataProcessor v1 #381
feat: DataProcessor v1 #381
Conversation
Thanks for making a pull request! 😃 |
a94c197
to
5e948f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dushyantbehl Thanks for the PR to let us know on the WIP. Just some comments on below would be appreciated.
tuning/data/data_handlers.py
Outdated
def apply_dataset_formatting( | ||
element: Dict[str, str], tokenizer: AutoTokenizer, dataset_text_field: str, **kwargs | ||
): | ||
return { | ||
f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When raw_datasets = raw_datasets.map(handler, **kwargs) is called and kwargs["batched"] = True
then element[f"{dataset_text_field}"]
here would be a list right. So does below condition make sense ?
if isinstance(element[dataset_text_field], list): # batched = True
return {
f"{dataset_text_field}": [
text + tokenizer.eos_token for text in element[f"{dataset_text_field}"]
]
}
return {
f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token
}
And similar case addition in tokenize_and_apply_instruction_masking
logic when kwargs["batched"] = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch @Abhishek-TAMU yes I need to update the code below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Abhishek-TAMU I consciously reverted this change and made our default handlers run in batched=False
mode for now, we can mention this in documentation for people to not use these handlers in batched
mode. This change was made due to the fact that this and the other handlers we have defined right now are complex operations and need us to deconstruct each example from a batch before processing them.
That being said this was after I had already implemented a patch to make all handlers take in batched
and non-batched
input and it is here in a different branch
so we can even cherry pick this to the current branch if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we can even cherry pick this to the current branch if needed.
This sounds fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the WIP PR @dushyantbehl, it's looking really nice so far!
Would definitely like to see some unit tests for these new data types and functions using the example configs you provided as unit tests will only get harder to add as we continue to build onto this. Additionally, we need to make sure our current unit tests pass to retain existing behavior. Overall though, these changes are a great starting point - thanks for your hard work on this!
b83c6e4
to
ac148eb
Compare
ed15fe3
to
4c0b109
Compare
7154c23
to
96b9467
Compare
0469db1
to
f299b39
Compare
2c9e86d
to
3bd42b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dushyant for cleanup of format_dataset
function and related test cases.
Thanks Will for the multi-gpu fix. Cleanup and other changes looks good to me as its not affecting current implementation.
I feel the PR lacks docmentation. We should have some things written out for the common use cases
|
82b548f
to
c94bbd2
Compare
We definitely need it, but we will add it in the next PR where we will be exposing these features to external users. Currently these are internal implementation which is used by the existing exposed interface. So I would recommend going ahead with this PR and following it up with these documentations in the next PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok that my comments are addressed, cc: @ashokponkumar @Ssukriti
826463f
to
e045ca7
Compare
@kmehant this change has been reverted for now as we need to get |
Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com> fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Removes unused dead code after adding the new framework and refactors some test cases and files. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
data preprocessing Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
Remove packing check as packing support for pretokenised data is merged to trl. See huggingface/trl#2011 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
e045ca7
to
e629228
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description of the change
This PR is to be merged after #398 as it follows the changes. Once #398 is merged this PR will be rebased on top of it.
This PR is to provide a data preprocessor framework which will enable flexibility to easily add more data preprocessing features in the future. This PR covers the following:
format_dataset
toprocess_dataargs
This PR does not explicitly enable any new features in fms-hf-tuning, it's purpose is to more easily allow new features to be added in the future by changing the backend.
Co-Authored by @willmj @Abhishek-TAMU
How to verify the PR
testing/data
train()
functionWas the PR tested