Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Kernels for Full FT and Non-Quantized PEFT #79

Merged
merged 19 commits into from
Sep 16, 2024
Merged

Allow Kernels for Full FT and Non-Quantized PEFT #79

merged 19 commits into from
Sep 16, 2024

Conversation

fabianlim
Copy link
Contributor

@fabianlim fabianlim commented Aug 30, 2024

Description

This PR

  1. upgrades framework to perform OR logic when activating plugins
  2. creates a FastKernelsAccelerationPlugin that is an improved version over FastQuantizedPeftAccelerationPlugin
  • it can add kernels individually
  • it can be activated under an training stanza or a peft.quantized stanza
  1. Add FOAK support to Full-Finetuning and Standard PEFT benchmarks
  2. FOAK support on 1 additional models
    • GPTBigCode
      • Note that due to GPTBigCode architecture limitations only FastCrossEntropyLoss is supported in this PR. Additional support will be tracked [placeholder issue]
  3. Bug fix to ModelPatcher to address multiple reloads to the same target path
    • This affected the proper patching of FastCrossEntropyLoss

Improvements to Full Finetuning

7% Improvement from following kernels (FastCrossEntropyLoss, FastRMSNorm, FastRoPE)

Framework Model num gpus batch size throughput (toks/s) Improvement %
fullFT Mistral7B 1 4 2910 base
foak-fullFT Mistral7B 1 4 3218 10.5
PEFT Mistral7B 1 4 3345 base
foak-PEFT Mistral7B 1 4 3797 13.5
Framework Model num gpus batch size throughput (toks/s) Improvement %
fullFT Mistral7B 2 4 2886 base
foak-fullFT Mistral7B 2 4 3093 7
PEFT Mistral7B 2 4 3227 base
foak-PEFT Mistral7B 2 4 3620 12

Compatibility Matrix with Mixed Precision

torch_dtype Mixed Precision Full-FT-FOAK PEFT-FOAK QPEFT-FOAK
FLOAT16 - ✗ Not Allowed
FLOAT16 FP16 ValueError:
Attempting to
unscale FP16 gradients.
See here
Compatible Compatible
BFLOAT16 -
BFLOAT16 BF16 Compatible Compatible Less Performant

Regression Test for Loss, Memory, Throughput

Running our alpaca benchmarks for most experiments in bfloat16 (except GPTQ-LoRA in float16. See issue). We see no significant regression in performance.

Note an outlier in the comparison plots show an anomalous memory increase in a standard full-FT experiment on Mistral7B with no accelerations installed. Since it does not point to any issues with the code in this PR, it might be caused by some slight instability of the benchmarking run.

Bug Fix to Model Patcher

There is no significant change in performance of FOAK from the fix for the improper patching of FastCrossEntropyLoss, however there is a slight decrease in improvement observed (consistent with issue 70) compared to previous paddingfree+foak numbers.

FLAN (6000 samples) with PaddingFree

Before BugFix

Framework Model num gpus batch size train_runtime (s) throughput (toks/s) Improvement %
BNB + foak Mistral7B 2 4 1068 1328 base
BNB + foak + paddingfree Mistral7B 2 4 605 2400 +43
GPTQ-LoRA + foak Mistral7B 2 4 1034 1372 base
GPTQ-LoRA + foak + paddingfree Mistral7B 2 4 587 2472 +43

With BugFix

Framework Model num gpus batch size train_runtime (s) throughput (toks/s) Improvement
BNB + foak Mistral7B 2 4 1038 1368 base
BNB + foak + paddingfree Mistral7B 2 4 674 2106 +35
GPTQ-LoRA + foak Mistral7B 2 4 1035 1372 base
GPTQ-LoRA + foak + paddingfree Mistral7B 2 4 660 2160 +36

Note:
Due to issues with FSDP-QLoRA in the latest transformers version (4.45.0dev) mentioned here, Granite with Fast Kernels will be addressed in a later PR instead.

TODO

  • add the activation (e.g. SWIGLU) kernels to FastKernelsAccelerationPlugin. Follow the pattern of building the fused-lora rule for a base_type.
  • add chunked loss (optional). If not done create issue.

@fabianlim fabianlim requested a review from achew010 August 30, 2024 08:30
@fabianlim fabianlim marked this pull request as draft August 30, 2024 08:30
@achew010 achew010 force-pushed the foak-full branch 3 times, most recently from 72012d5 to e47d48a Compare September 6, 2024 04:40
@achew010 achew010 marked this pull request as ready for review September 6, 2024 04:43
Copy link
Contributor Author

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you get the bfloat16 numbers, lets compare with the float16 numbers to see if there are substantial changes.

Also, lets have in the FoAK readme, to documnet down future items for kernels that are missing for certain models

Model norm pos emb cross-ent fused_lora
LlamaForCausalLM

@fabianlim
Copy link
Contributor Author

fabianlim commented Sep 6, 2024

@achew010 pls also note that currently we do not support pos ids with rope kernels. we need to document the impact on this

#33

I think if its padding free its no impact, but need to confirm

fabianlim and others added 5 commits September 16, 2024 01:58
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
achew010 and others added 13 commits September 16, 2024 01:58
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
…f activating kernels

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
@fabianlim fabianlim force-pushed the foak-full branch 2 times, most recently from be39ac9 to 369f738 Compare September 16, 2024 04:59
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
@fabianlim
Copy link
Contributor Author

@achew010 there are some benchmarks still not updated, but I will merge this first then we can address in later PR

@fabianlim fabianlim merged commit 4e81c64 into main Sep 16, 2024
6 checks passed
@fabianlim fabianlim mentioned this pull request Sep 16, 2024
@fabianlim fabianlim deleted the foak-full branch October 11, 2024 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants