Allow PaddingFree to work with DataCollatorForCompletionOnlyLM #78

fabianlim · 2024-08-30T02:25:07Z

Description

Currently padding free plugin only works with DataCollatorWithFlattening. This PR makes it also work with DataCollatorForCompletionOnlyLM that is used for the non-pretokenized use case

in the scenarios.yaml, we should be above to use the chat_templates, but set tokenize=False and remove the null settings on the required fields for tokenization.
verified that full-FT + paddingfree improvement is consistent at 22-23% for orca benches

Note:

There are some inconsistent values in benchmarks for PF with FOAK tracked in Inconsistency in Padding-Free Benchmarks with Different Transformers Versions #70.

Tests on Flan Subset (6000 samples)

Verified that dataset data_cache.json is only formatted and untokenized. To ensure that the loss is masked, added a keyword 'RESPONSE:' in the chat template and as the response template that DataCollatorForCompletionOnlyLM will use to mask the loss.

Example extracted from dataset['train']['output'][0]

'Write the response. A 2 person conversation: -- Who was selected with the 5th pick in the 1974 NBA draft?. --  RESPONSE: Five other players from this draft, 6th pick Scott Wedman, 8th pick Campy Russell , 12th pick Brian Winters, 21st pick Billy Knight and 25th pick John Drew, were also selected to at least one All-Star Game.'

Verified that using an untokenized dataset to SFTTrainer matches previous padding-free performance with a pretokenized dataset.

Untokenized FLAN Dataset

Framework Config Num Devices Per Device Batch Size Train Runtime (secs) Speedups

full-FT 2 4 1516 baseline

padding-free 2 4 848 1.78x

padding-free + multipack 2 4 747 2.02x

Tokenized FLAN Dataset

Framework Config Num Devices Per Device Batch Size Train Runtime (secs) Speedups

full-FT 2 4 1537 baseline

padding-free 2 4 859 1.79 x

padding-free + multipack 2 4 751 2.05 x

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

scripts/benchmarks/benchmark.py

scripts/benchmarks/data_processing.py

fabianlim · 2024-08-31T08:59:36Z

@achew010 can you do a sanity check and open the data_cache.json to ensure it was not tokenized when tokenize=False

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

scripts/benchmarks/refs_orca/requirements.txt

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

* allow for padding_free logic in LM data collator Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * minor fixes to support non-pretok benchmarks Signed-off-by: 1000850000 user <aaron.chew1@ibm.com> * addressed code review Signed-off-by: 1000850000 user <aaron.chew1@ibm.com> * added trl dependency Signed-off-by: 1000850000 user <aaron.chew1@ibm.com> * fixes to installation of aadp Signed-off-by: 1000850000 user <aaron.chew1@ibm.com> * updated orca pf benchmarks Signed-off-by: 1000850000 user <aaron.chew1@ibm.com> --------- Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com> Co-authored-by: 1000850000 user <aaron.chew1@ibm.com>

fabianlim marked this pull request as draft August 30, 2024 02:25

allow for padding_free logic in LM data collator

5fe8dfd

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim force-pushed the non-pretok-pf branch from efa046d to 5fe8dfd Compare August 30, 2024 02:28

fabianlim requested a review from achew010 August 30, 2024 03:40

minor fixes to support non-pretok benchmarks

5b02774

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

achew010 force-pushed the non-pretok-pf branch from 0dc0562 to 5b02774 Compare August 30, 2024 08:33

fabianlim commented Aug 30, 2024

View reviewed changes

scripts/benchmarks/benchmark.py Outdated Show resolved Hide resolved

fabianlim commented Aug 30, 2024

View reviewed changes

scripts/benchmarks/data_processing.py Outdated Show resolved Hide resolved

achew010 added 2 commits September 3, 2024 02:24

addressed code review

bc786c8

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

added trl dependency

4831b35

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

achew010 force-pushed the non-pretok-pf branch from 7262036 to 4831b35 Compare September 3, 2024 03:55

achew010 marked this pull request as ready for review September 3, 2024 03:55

achew010 mentioned this pull request Sep 3, 2024

Inconsistency in Padding-Free Benchmarks with Different Transformers Versions #70

Open

fabianlim commented Sep 4, 2024

View reviewed changes

scripts/benchmarks/refs_orca/requirements.txt Outdated Show resolved Hide resolved

achew010 force-pushed the non-pretok-pf branch from 35ccae4 to afa1390 Compare September 4, 2024 06:26

fixes to installation of aadp

fe9c3d1

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

achew010 force-pushed the non-pretok-pf branch 3 times, most recently from f54550d to 895900b Compare September 5, 2024 04:33

updated orca pf benchmarks

c82945d

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

achew010 force-pushed the non-pretok-pf branch from 895900b to c82945d Compare September 5, 2024 04:37

fabianlim merged commit 6028250 into main Sep 5, 2024
6 checks passed

fabianlim deleted the non-pretok-pf branch September 5, 2024 05:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow PaddingFree to work with DataCollatorForCompletionOnlyLM #78

Allow PaddingFree to work with DataCollatorForCompletionOnlyLM #78

fabianlim commented Aug 30, 2024 •

edited by achew010

Loading

fabianlim commented Aug 31, 2024

Framework Config	Num Devices	Per Device Batch Size	Train Runtime (secs)	Speedups
full-FT	2	4	1516	baseline
padding-free	2	4	848	1.78x
padding-free + multipack	2	4	747	2.02x

Framework Config	Num Devices	Per Device Batch Size	Train Runtime (secs)	Speedups
full-FT	2	4	1537	baseline
padding-free	2	4	859	1.79 x
padding-free + multipack	2	4	751	2.05 x

Allow PaddingFree to work with DataCollatorForCompletionOnlyLM #78

Allow PaddingFree to work with DataCollatorForCompletionOnlyLM #78

Conversation

fabianlim commented Aug 30, 2024 • edited by achew010 Loading

Description

Tests on Flan Subset (6000 samples)

fabianlim commented Aug 31, 2024

fabianlim commented Aug 30, 2024 •

edited by achew010

Loading