feat: DataProcessor v1 #381

dushyantbehl · 2024-10-27T16:10:20Z

Description of the change

This PR is to be merged after #398 as it follows the changes. Once #398 is merged this PR will be rebased on top of it.

This PR is to provide a data preprocessor framework which will enable flexibility to easily add more data preprocessing features in the future. This PR covers the following:

Change base framework to the new configurable framework
- Preprocessing utils function changed from format_dataset to process_dataargs
Handle currently supported features using the new base framework
Cleanup redundant code and port unit tests.
Unit testing data preprocessor
Unit testing setup data preprocessor

This PR does not explicitly enable any new features in fms-hf-tuning, it's purpose is to more easily allow new features to be added in the future by changing the backend.

Co-Authored by @willmj @Abhishek-TAMU

How to verify the PR

New/refactored unit tests located in testing/data
Unit tests of train() function
Run e2e tuning + vLLM (see example build)

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

github-actions · 2024-10-27T16:10:30Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

tuning/sft_trainer.py

Abhishek-TAMU

@dushyantbehl Thanks for the PR to let us know on the WIP. Just some comments on below would be appreciated.

Abhishek-TAMU · 2024-10-29T22:36:34Z

tuning/data/data_handlers.py

+def apply_dataset_formatting(
+    element: Dict[str, str], tokenizer: AutoTokenizer, dataset_text_field: str, **kwargs
+):
+    return {
+        f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token
+    }
+


When raw_datasets = raw_datasets.map(handler, **kwargs) is called and kwargs["batched"] = True then element[f"{dataset_text_field}"] here would be a list right. So does below condition make sense ?

if isinstance(element[dataset_text_field], list): # batched = True return { f"{dataset_text_field}": [ text + tokenizer.eos_token for text in element[f"{dataset_text_field}"] ] } return { f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token }

And similar case addition in tokenize_and_apply_instruction_masking logic when kwargs["batched"] = True

Thanks for the catch @Abhishek-TAMU yes I need to update the code below.

@Abhishek-TAMU I consciously reverted this change and made our default handlers run in batched=False mode for now, we can mention this in documentation for people to not use these handlers in batched mode. This change was made due to the fact that this and the other handlers we have defined right now are complex operations and need us to deconstruct each example from a batch before processing them.

That being said this was after I had already implemented a patch to make all handlers take in batched and non-batched input and it is here in a different branch
so we can even cherry pick this to the current branch if needed.

cc @Ssukriti @willmj @ashokponkumar

so we can even cherry pick this to the current branch if needed.

This sounds fine to me.

tuning/utils/utils.py

tuning/data/data_handlers.py

willmj

Thanks for the WIP PR @dushyantbehl, it's looking really nice so far!
Would definitely like to see some unit tests for these new data types and functions using the example configs you provided as unit tests will only get harder to add as we continue to build onto this. Additionally, we need to make sure our current unit tests pass to retain existing behavior. Overall though, these changes are a great starting point - thanks for your hard work on this!

architecture_records/004-dataloader-v2.md

tuning/data/data_processors.py

tuning/data/data_handlers.py

Abhishek-TAMU

Thanks Dushyant for cleanup of format_dataset function and related test cases.
Thanks Will for the multi-gpu fix. Cleanup and other changes looks good to me as its not affecting current implementation.

fabianlim · 2024-11-27T07:30:50Z

I feel the PR lacks docmentation. We should have some things written out for the common use cases

adding a new data preprocessor
adding a new handler. etc..
fallback on for use cases that does not require the data preprocessor routines..
how to write a data config.. etc.

ashokponkumar · 2024-11-28T15:05:57Z

I feel the PR lacks docmentation. We should have some things written out for the common use cases

adding a new data preprocessor

adding a new handler. etc..

fallback on for use cases that does not require the data preprocessor routines..

how to write a data config.. etc.

We definitely need it, but we will add it in the next PR where we will be exposing these features to external users. Currently these are internal implementation which is used by the existing exposed interface.

So I would recommend going ahead with this PR and following it up with these documentations in the next PR.

fabianlim

I am ok that my comments are addressed, cc: @ashokponkumar @Ssukriti

dushyantbehl · 2024-12-02T08:44:32Z

@dushyantbehl Support for packing pre-tokenized datasets is now first class feature in huggingface/trl. Good to allow for that in fms-hf-tuning.

fms-hf-tuning check previously present which can be relaxed and as well absorbed in to this PR :

fms-hf-tuning/tuning/utils/preprocessing_utils.py

Line 85 in 5c29940

raise ValueError("packing will not be used when datasets are pretokenized")

PR for reference: huggingface/trl#2011

cc: @ashokponkumar

@kmehant this change has been reverted for now as we need to get transformers up to the version where this patch will be supported. Please open up an issue and we can take this up when trl is up to date.

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

Signed-off-by: Will Johnson <mwjohnson728@gmail.com> fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Removes unused dead code after adding the new framework and refactors some test cases and files. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

data preprocessing Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

Remove packing check as packing support for pretokenised data is merged to trl. See huggingface/trl#2011 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

willmj

LGTM

github-actions bot added the feat label Oct 27, 2024

dushyantbehl force-pushed the dataloader-v2-impl branch from a94c197 to 5e948f1 Compare October 27, 2024 16:26

fabianlim reviewed Oct 28, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Abhishek-TAMU reviewed Oct 29, 2024

View reviewed changes

Ssukriti reviewed Oct 29, 2024

View reviewed changes

tuning/data/data_handlers.py Outdated Show resolved Hide resolved

dushyantbehl changed the title ~~feat: [WIP] Dataloader v2 impl~~ feat: [WIP] DataProcessor v2 impl Oct 30, 2024

willmj reviewed Oct 30, 2024

View reviewed changes

architecture_records/004-dataloader-v2.md Outdated Show resolved Hide resolved

tuning/data/data_processors.py Outdated Show resolved Hide resolved

dushyantbehl force-pushed the dataloader-v2-impl branch 8 times, most recently from b83c6e4 to ac148eb Compare November 8, 2024 20:06

dushyantbehl force-pushed the dataloader-v2-impl branch from ed15fe3 to 4c0b109 Compare November 12, 2024 08:52

dushyantbehl changed the title ~~feat: [WIP] DataProcessor v2 impl~~ feat: [WIP] DataProcessor v2 Nov 18, 2024

dushyantbehl marked this pull request as ready for review November 19, 2024 09:09

dushyantbehl requested review from anhuong, aluu317 and kmehant as code owners November 19, 2024 09:09

dushyantbehl changed the title ~~feat: [WIP] DataProcessor v2~~ feat: DataProcessor v2 Nov 20, 2024

dushyantbehl force-pushed the dataloader-v2-impl branch from 7154c23 to 96b9467 Compare November 20, 2024 10:02

Abhishek-TAMU reviewed Nov 21, 2024

View reviewed changes

tuning/data/data_handlers.py Outdated Show resolved Hide resolved

Abhishek-TAMU mentioned this pull request Nov 21, 2024

fix: Changes in function process_dataargs to support the current implementation dushyantbehl/fms-hf-tuning#2

Merged

2 tasks

dushyantbehl force-pushed the dataloader-v2-impl branch 2 times, most recently from 0469db1 to f299b39 Compare November 22, 2024 03:37

willmj force-pushed the dataloader-v2-impl branch from 2c9e86d to 3bd42b5 Compare November 22, 2024 18:44

Abhishek-TAMU reviewed Nov 22, 2024

View reviewed changes

dushyantbehl changed the title ~~feat: DataProcessor v2~~ feat: DataProcessor v1 Nov 28, 2024

dushyantbehl force-pushed the dataloader-v2-impl branch 3 times, most recently from 82b548f to c94bbd2 Compare November 28, 2024 12:52

fabianlim reviewed Nov 29, 2024

View reviewed changes

dushyantbehl force-pushed the dataloader-v2-impl branch from 826463f to e045ca7 Compare December 2, 2024 08:42

dushyantbehl and others added 17 commits December 3, 2024 09:54

Add initial implementation of dataloader v1

20c1507

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

tests: reformat mock.patch to inside unit tests

85920f2

Signed-off-by: Will Johnson <mwjohnson728@gmail.com> fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Add data config argument to data preprocessor

d5e28fa

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

fix: Changes to support current implementation

2b75f97

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Ensure data handling is done within process dataargs

6bab5a2

Removes unused dead code after adding the new framework and refactors some test cases and files. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

Remove accelerator in favor of torch distributed check for multi node

42816eb

data preprocessing Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

Refactor data util tests as data handler tests.

9b2a50e

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

fix: add __init__.py to add tuning.data to python package

8af3d55

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: multi GPU prepare training dataset

28161a1

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: lint

c86e1fc

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: Add TODO

bcce6cb

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

test: add test for process_dataset_configs in HFBasedDataPreProcessor

7661b39

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

add: test cases for framework

af10481

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

fix: update function name get_dataprocessor->get_datapreprocessor

89437d4

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Rename loader to processor

3e97129

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

data folders should be together

befb5e7

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

Add code comments and make code path clearer.

e629228

Remove packing check as packing support for pretokenised data is merged to trl. See huggingface/trl#2011 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl force-pushed the dataloader-v2-impl branch from e045ca7 to e629228 Compare December 3, 2024 04:25

willmj approved these changes Dec 3, 2024

View reviewed changes

willmj merged commit 7df3416 into foundation-model-stack:main Dec 3, 2024
8 checks passed

dushyantbehl deleted the dataloader-v2-impl branch December 3, 2024 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: DataProcessor v1 #381

feat: DataProcessor v1 #381

dushyantbehl commented Oct 27, 2024 •

edited

Loading

github-actions bot commented Oct 27, 2024

Abhishek-TAMU left a comment

Abhishek-TAMU Oct 29, 2024 •

edited

Loading

dushyantbehl Nov 8, 2024

dushyantbehl Nov 20, 2024

Abhishek-TAMU Nov 20, 2024

willmj left a comment

Abhishek-TAMU left a comment

fabianlim commented Nov 27, 2024

ashokponkumar commented Nov 28, 2024

fabianlim left a comment

dushyantbehl commented Dec 2, 2024

willmj left a comment

feat: DataProcessor v1 #381

feat: DataProcessor v1 #381

Conversation

dushyantbehl commented Oct 27, 2024 • edited Loading

Description of the change

How to verify the PR

Was the PR tested

github-actions bot commented Oct 27, 2024

Abhishek-TAMU left a comment

Choose a reason for hiding this comment

Abhishek-TAMU Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

dushyantbehl Nov 8, 2024

Choose a reason for hiding this comment

dushyantbehl Nov 20, 2024

Choose a reason for hiding this comment

Abhishek-TAMU Nov 20, 2024

Choose a reason for hiding this comment

willmj left a comment

Choose a reason for hiding this comment

Abhishek-TAMU left a comment

Choose a reason for hiding this comment

fabianlim commented Nov 27, 2024

ashokponkumar commented Nov 28, 2024

fabianlim left a comment

Choose a reason for hiding this comment

dushyantbehl commented Dec 2, 2024

willmj left a comment

Choose a reason for hiding this comment

dushyantbehl commented Oct 27, 2024 •

edited

Loading

Abhishek-TAMU Oct 29, 2024 •

edited

Loading