Allow Kernels for Full FT and Non-Quantized PEFT #79

fabianlim · 2024-08-30T08:30:06Z

Description

This PR

upgrades framework to perform OR logic when activating plugins
creates a FastKernelsAccelerationPlugin that is an improved version over FastQuantizedPeftAccelerationPlugin

it can add kernels individually
it can be activated under an training stanza or a peft.quantized stanza

Add FOAK support to Full-Finetuning and Standard PEFT benchmarks
FOAK support on 1 additional models
- GPTBigCode
  - Note that due to GPTBigCode architecture limitations only FastCrossEntropyLoss is supported in this PR. Additional support will be tracked [placeholder issue]
Bug fix to ModelPatcher to address multiple reloads to the same target path
- This affected the proper patching of FastCrossEntropyLoss

Improvements to Full Finetuning

7% Improvement from following kernels (FastCrossEntropyLoss, FastRMSNorm, FastRoPE)

Framework	Model	num gpus	batch size	throughput (toks/s)	Improvement %
fullFT	Mistral7B	1	4	2910	base
foak-fullFT	Mistral7B	1	4	3218	10.5
PEFT	Mistral7B	1	4	3345	base
foak-PEFT	Mistral7B	1	4	3797	13.5

Framework	Model	num gpus	batch size	throughput (toks/s)	Improvement %
fullFT	Mistral7B	2	4	2886	base
foak-fullFT	Mistral7B	2	4	3093	7
PEFT	Mistral7B	2	4	3227	base
foak-PEFT	Mistral7B	2	4	3620	12

Compatibility Matrix with Mixed Precision

torch_dtype	Mixed Precision	Full-FT-FOAK	PEFT-FOAK	QPEFT-FOAK
FLOAT16	-	✗ Not Allowed	✗	✗
FLOAT16	FP16	ValueError: Attempting to unscale FP16 gradients. See here	Compatible	Compatible
BFLOAT16	-	✗	✗	✗
BFLOAT16	BF16	Compatible	Compatible	Less Performant

Regression Test for Loss, Memory, Throughput

Running our alpaca benchmarks for most experiments in bfloat16 (except GPTQ-LoRA in float16. See issue). We see no significant regression in performance.

Note an outlier in the comparison plots show an anomalous memory increase in a standard full-FT experiment on Mistral7B with no accelerations installed. Since it does not point to any issues with the code in this PR, it might be caused by some slight instability of the benchmarking run.

Bug Fix to Model Patcher

There is no significant change in performance of FOAK from the fix for the improper patching of FastCrossEntropyLoss, however there is a slight decrease in improvement observed (consistent with issue 70) compared to previous paddingfree+foak numbers.

FLAN (6000 samples) with PaddingFree

Before BugFix

Framework	Model	num gpus	batch size	train_runtime (s)	throughput (toks/s)	Improvement %
BNB + foak	Mistral7B	2	4	1068	1328	base
BNB + foak + paddingfree	Mistral7B	2	4	605	2400	+43
GPTQ-LoRA + foak	Mistral7B	2	4	1034	1372	base
GPTQ-LoRA + foak + paddingfree	Mistral7B	2	4	587	2472	+43

With BugFix

Framework	Model	num gpus	batch size	train_runtime (s)	throughput (toks/s)	Improvement
BNB + foak	Mistral7B	2	4	1038	1368	base
BNB + foak + paddingfree	Mistral7B	2	4	674	2106	+35
GPTQ-LoRA + foak	Mistral7B	2	4	1035	1372	base
GPTQ-LoRA + foak + paddingfree	Mistral7B	2	4	660	2160	+36

Note:
Due to issues with FSDP-QLoRA in the latest transformers version (4.45.0dev) mentioned here, Granite with Fast Kernels will be addressed in a later PR instead.

TODO

add the activation (e.g. SWIGLU) kernels to FastKernelsAccelerationPlugin. Follow the pattern of building the fused-lora rule for a base_type.
add chunked loss (optional). If not done create issue.

plugins/fused-ops-and-kernels/configs/fast_full.yaml

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/models/granite.py

fabianlim

When you get the bfloat16 numbers, lets compare with the float16 numbers to see if there are substantial changes.

Also, lets have in the FoAK readme, to documnet down future items for kernels that are missing for certain models

Model	norm	pos emb	cross-ent	fused_lora
`LlamaForCausalLM`	✅	✅	✅	✅

scripts/generate_sample_configurations.py

scripts/benchmarks/scenarios.yaml

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/framework_plugin_fast_kernels.py

plugins/framework/pyproject.toml

plugins/fused-ops-and-kernels/README.md

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/framework_plugin_fast_kernels.py

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py

fabianlim · 2024-09-06T10:11:39Z

@achew010 pls also note that currently we do not support pos ids with rope kernels. we need to document the impact on this

#33

I think if its padding free its no impact, but need to confirm

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py

scripts/benchmarks/scenarios.yaml

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/framework_plugin_fast_kernels.py

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/framework_plugin_fast_quantized_peft.py

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/framework_plugin_fast_kernels.py

plugins/fused-ops-and-kernels/README.md

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py

scripts/benchmarks/scenarios.yaml

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

…f activating kernels Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim · 2024-09-16T06:23:27Z

@achew010 there are some benchmarks still not updated, but I will merge this first then we can address in later PR

fabianlim requested a review from achew010 August 30, 2024 08:30

fabianlim marked this pull request as draft August 30, 2024 08:30

fabianlim commented Aug 30, 2024

View reviewed changes

plugins/fused-ops-and-kernels/configs/fast_full.yaml Outdated Show resolved Hide resolved

achew010 force-pushed the foak-full branch 3 times, most recently from 72012d5 to e47d48a Compare September 6, 2024 04:40

achew010 marked this pull request as ready for review September 6, 2024 04:43

fabianlim commented Sep 6, 2024

View reviewed changes

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/models/granite.py Outdated Show resolved Hide resolved

fabianlim commented Sep 6, 2024

View reviewed changes

achew010 force-pushed the foak-full branch from e47d48a to 870ea03 Compare September 10, 2024 08:36

fabianlim commented Sep 10, 2024

View reviewed changes

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py Outdated Show resolved Hide resolved

fabianlim commented Sep 10, 2024

View reviewed changes

scripts/benchmarks/scenarios.yaml Show resolved Hide resolved

fabianlim commented Sep 10, 2024

View reviewed changes

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/framework_plugin_fast_kernels.py Outdated Show resolved Hide resolved

fabianlim commented Sep 11, 2024

View reviewed changes

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py Show resolved Hide resolved

fabianlim commented Sep 11, 2024

View reviewed changes

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/framework_plugin_fast_quantized_peft.py Outdated Show resolved Hide resolved

fabianlim commented Sep 11, 2024

View reviewed changes

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/framework_plugin_fast_kernels.py Outdated Show resolved Hide resolved

fabianlim commented Sep 11, 2024

View reviewed changes

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/framework_plugin_fast_kernels.py Outdated Show resolved Hide resolved

achew010 force-pushed the foak-full branch from 06c5851 to d4b89c4 Compare September 11, 2024 11:09

achew010 mentioned this pull request Sep 12, 2024

Inconsistency in Padding-Free Benchmarks with Different Transformers Versions #70

Open

fabianlim commented Sep 12, 2024

View reviewed changes

plugins/fused-ops-and-kernels/README.md Show resolved Hide resolved

fabianlim commented Sep 13, 2024

View reviewed changes

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py Show resolved Hide resolved

fabianlim commented Sep 13, 2024

View reviewed changes

scripts/benchmarks/scenarios.yaml Outdated Show resolved Hide resolved

fabianlim commented Sep 13, 2024

View reviewed changes

scripts/benchmarks/scenarios.yaml Outdated Show resolved Hide resolved

fabianlim commented Sep 13, 2024

View reviewed changes

scripts/benchmarks/scenarios.yaml Outdated Show resolved Hide resolved

fabianlim and others added 5 commits September 16, 2024 01:58

add or logic for plugin registration

4b086c5

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

add fast kernels plugin

277beb6

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

prepare full-foak benchmarks

d0d6c62

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

update benchmark logic to have empty framework_config

58a4a84

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

minor fixes to foak full

f5ff079

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

achew010 and others added 13 commits September 16, 2024 01:58

addressed code review changes

abe7cb4

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

additional fixes from code review

fec71f8

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

minor fixes to standard peft

ae7b71b

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

Apply suggestions from code review

563159c

Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

changes to filtering function and modifications to allow flexibilty o…

12dadf9

…f activating kernels Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

additional check in fastkernels and changes to FOAK README.md

b8e76fb

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

fix syntax error

5c20676

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

fix reloads on multiple patches

e8d2cec

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

dtype changes to scenarios.yaml and README.md

cdcbc32

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

changes to scenarios.yaml

f7fdaac

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

additional comments

8080f86

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

format and lint

87663c1

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

fixes and updates to benchmark

9991375

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

achew010 force-pushed the foak-full branch from 4cb95d3 to 9991375 Compare September 16, 2024 01:59

fabianlim force-pushed the foak-full branch 2 times, most recently from be39ac9 to 369f738 Compare September 16, 2024 04:59

fixes on reload

03fc41c

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim force-pushed the foak-full branch from 369f738 to 03fc41c Compare September 16, 2024 05:08

fabianlim merged commit 4e81c64 into main Sep 16, 2024
6 checks passed

fabianlim mentioned this pull request Sep 16, 2024

Update Benches: Orca #85

Merged

fabianlim deleted the foak-full branch October 11, 2024 00:21

fabianlim mentioned this pull request Oct 13, 2024

Quickfix: Accelerate YAML and LoRA Fused Ops #92

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Kernels for Full FT and Non-Quantized PEFT #79

Allow Kernels for Full FT and Non-Quantized PEFT #79

fabianlim commented Aug 30, 2024 •

edited by achew010

Loading

fabianlim left a comment •

edited

Loading

fabianlim commented Sep 6, 2024 •

edited

Loading

fabianlim commented Sep 16, 2024

Allow Kernels for Full FT and Non-Quantized PEFT #79

Allow Kernels for Full FT and Non-Quantized PEFT #79

Conversation

fabianlim commented Aug 30, 2024 • edited by achew010 Loading

Description

Improvements to Full Finetuning

Regression Test for Loss, Memory, Throughput

Bug Fix to Model Patcher

FLAN (6000 samples) with PaddingFree

TODO

fabianlim left a comment • edited Loading

Choose a reason for hiding this comment

fabianlim commented Sep 6, 2024 • edited Loading

fabianlim commented Sep 16, 2024

fabianlim commented Aug 30, 2024 •

edited by achew010

Loading

fabianlim left a comment •

edited

Loading

fabianlim commented Sep 6, 2024 •

edited

Loading