Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels #280

Merged

Conversation

achew010
Copy link
Contributor

@achew010 achew010 commented Aug 2, 2024

Description of the change

This PR adds two dataclass arguments to enable padding free and multipack for the sft_trainer.py, via the new fms acceleration attention-and-distributed-packing plugin and allows the current --fastkernels dataclass to support optimized full-finetuning:

  • --padding_free: technique to process multiple examples in single batch without adding padding tokens that waste compute.
  • --multipack: technique for multi-gpu training to balance out number of tokens processed in each device, to minimize waiting time.
  • --fast_kernels: Previously limited only for QPEFT (used to raise if not activated with --fast_lora), Now allows for optimized full/standard LoRA finetuning.

These are extremely effective methods to improve training throughputs:

  • see the section on benchmarks below. Currently, either padding free is used alone, or together with multipack. We do not currently support the option of using multipack alone.
  • padding free and multipack used in the instructlab (ILAB) work, see below on the section of the early version of this plugin. For general use when producing this plugin, we have greatly simplified the user interface

NOTE: adhering to the design of fms-acceleration, the new plugin is optional, and separately installed.

Notes on Padding Free

Notes on Multipack

  • works only for multi-gpu.
  • currently only includes the version of multipack optimized for linear attention implementations like flash-attn.

Notes on FastKernels

  • currently supports FastCrossEntropyLoss, FastRoPE, FastRMSLayerNorm but will include SwiGLU and Liger Kernels (e.g. FusedCrossEntropyLoss) in the future
  • Works for full-finetuning, LoRA and QPEFT,
    • pass --fast_kernels True True True on full finetuning/LoRA runs
    • pass --fast_kernels True True True --auto_gptq triton_v2 --fused_lora auto_gptq True for GPTQ-LoRA
    • pass --fast_kernels True True True --bitsandbytes nf4 --fused_lora bitsandbytes True for QLoRA
  • FastRoPE currently doesn't accept positional_ids but this issue will be addressed in the future

Benchmarks

PaddingFree and Multipack Benchmarks for Mistral 7B

Notes:

  • Shown below are the runtimes for running a subset of 6000 FLAN samples.
  • Tested two cases of per device batch sizes 4 and 8, for varying number gpus from 2 to 8
  • Verified that untokenized dataset produces the same improvements for paddingfree and multipack

Per Device Batch Size 4

Framework Config Num Devices Per Device Batch Size Train Runtime (secs) Speedups
full-FT 2 4 1537 baseline
padding-free 2 4 859 1.79 x
padding-free + multipack 2 4 751 2.05 x
full-FT 4 4 932 baseline
padding-free 4 4 483 1.93 x
padding-free + multipack 4 4 342 2.75 x
full-FT 8 4 551 baseline
padding-free 8 4 275 2.00 x
padding-free + multipack 8 4 163 3.38 x

Per Device Batch Size 8

Framework Config Num Devices Per Device Batch Size Train Runtime (secs) Speedup
full-FT 2 8 1722 baseline
padding-free 2 8 678 2.54 x
padding-free + multipack 2 8 603 2.86 x
full-FT 4 8 1025 baseline
padding-free 4 8 380 2.70 x
padding-free + multipack 4 8 289 3.55 x
full-FT 8 8 611 baseline
padding-free 8 8 215 2.84 x
padding-free + multipack 8 8 140 4.36 x

Verified Similar Improvements for Untokenized Dataset

Framework Config Num Devices Per Device Batch Size Train Runtime (secs) Speedups
full-FT 2 4 1516 baseline
padding-free 2 4 848 1.78x
padding-free + multipack 2 4 747 2.02x

Full Finetuning Benchmarks for Mistral 7B

Early Version Of This Plugin

We have an unofficial version with more features than our present release. @kmehant is currently using for ILAB work. It addition to the padding-free and multipack, it also has the additional two plugins below:

To use the early version a quick hack of sft_trainer with pretokenized + custom tokenizer: https://github.com/fabianlim/fms-hf-tuning/tree/attn-plugin . This will be superceded by this PR in the near future

Use with these command line arugments:

	  --padding_free huggingface-injected \
	  --loss_across_gpus mean token \

How to verify the PR

Additional checks/tests were added to

  1. Ensures parsing --padding_free and multipack is correct in test_dataclass_parse_successfully
  2. Ensures wrong arguments to --padding_free are caught in test_dataclass_will_fail_to_accept_illegal_args
  3. Ensures Plugin is successfully instantiated from dataclass in test_framework_initialize_and_trains_with_aadp
  4. Ensure --padding_free must be used with flash-attn, otherwise error is raised
  5. Ensure --multi_pack must be used with --padding_free, otherwise error is raised
  6. Ensure --packing True with --padding_free will raise an error
  7. Ensure --fast_kernels works with full finetuning
  8. Ensure that --fast_lora not called with either --auto_gptq or --bitsandbytes will raise an error

Ran the full suite of acceleration checks to verify all fms-acceleration unit tests passed

pytest tests/acceleration/

image

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@achew010 achew010 force-pushed the args-for-padding-free-plugin branch 2 times, most recently from e350ae7 to 193ab9d Compare August 2, 2024 06:20
@achew010 achew010 marked this pull request as ready for review August 2, 2024 06:25
@achew010 achew010 force-pushed the args-for-padding-free-plugin branch 2 times, most recently from 2310f32 to f9d046f Compare August 6, 2024 05:38
Copy link
Collaborator

@kmehant kmehant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achew010 lets gracefully handle the case when use_flash_attn is set to False and padding free is being used.

use_flash_attn: bool = field(

@achew010 achew010 marked this pull request as draft August 6, 2024 13:55
@fabianlim fabianlim changed the title Add DataClass Arguments to Activate Padding-Free Plugin Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin Aug 28, 2024
@achew010 achew010 force-pushed the args-for-padding-free-plugin branch 2 times, most recently from 29362a4 to 00d17e7 Compare August 29, 2024 09:41
@fabianlim fabianlim force-pushed the args-for-padding-free-plugin branch 6 times, most recently from 3b20f22 to 53d1a8c Compare August 29, 2024 10:59
@fabianlim fabianlim marked this pull request as ready for review August 29, 2024 10:59
@achew010 achew010 force-pushed the args-for-padding-free-plugin branch from 8f1c9ea to b15a9c7 Compare September 4, 2024 09:04
@achew010 achew010 force-pushed the args-for-padding-free-plugin branch from 3cccc41 to 46d587f Compare September 11, 2024 11:33
@fabianlim fabianlim changed the title Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels Sep 12, 2024
tuning/sft_trainer.py Outdated Show resolved Hide resolved
@kmehant kmehant requested review from kmehant and removed request for kmehant September 16, 2024 06:48
Copy link
Collaborator

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the excellent change and description! Had a few questions...also am wondering, should the plugin be installed by default so users can utilize these new parameters? Looks like a very useful addition.

Also please add some of the great description from this PR into the readme.

@dataclass
class MultiPack:

num_processes: int = 16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any guidance on what this number should be set to?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this number is a reasonable one for most datasets of reasonable size (e.g. under a million examples). The packing algorithm is relatively fast, but in the event the dataset is too large, then our plugin will raise a warning
https://github.com/foundation-model-stack/fms-acceleration/blob/4e81c64453ec5d2b06a8d14a2a72374cc736098a/plugins/attention-and-distributed-packing/src/fms_acceleration_aadp/framework_plugin_multipack.py#L117-L123

that advises the user to increase this number if the process times out.

Comment on lines 199 to +206
framework = AccelerationFrameworkConfig.from_dataclasses(
quantized_lora_config, fusedops_kernels_config
quantized_lora_config,
fusedops_kernels_config,
attention_and_distributed_packing_config,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, so these are all model loader augmentors that change how the model is loaded based on the acceleration framework configurations? Although padding free and multipack are both dataset augmentors? How does setting the acceleration framework here affect the dataset loading?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right in saying padding free and multipack affect the dataloading, but more specifically

  • padding free only requires modifications to data collation.
  • multpack requires modification to dataloader

Both we handle by our AccelerationPatcher, which is a component that we wrote to allow controlled replacements of the data collator and data loader.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the explanation

"ensure `use_flash_attn = True` to use padding-free flash attention"
)

if train_args.packing is True:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can simplify to if train_args.packing

@fabianlim
Copy link
Collaborator

@anhuong thanks for the review. For making this default, I drafted out various possibilities in this issue here #334. We can discuss offline,

@kmehant kmehant self-requested a review September 18, 2024 10:48
Copy link
Collaborator

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small additional comments

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Comment on lines +467 to +636
* `fused_ops_and_kernels` works for full-finetuning, LoRA, QLoRA and GPTQ-LORA,
- pass `--fast_kernels True True True` for full finetuning/LoRA
- pass `--fast_kernels True True True --auto_gptq triton_v2 --fused_lora auto_gptq True` for GPTQ-LoRA
- pass `--fast_kernels True True True --bitsandbytes nf4 --fused_lora bitsandbytes True` for QLoRA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering for fast-kernels if there is a better way to understand what is being set to true
--fast_kernels True True True feels unclear on what is being set to True. Could the user instead pass in --fast_kernels <types of kernel to use> like --fast_kernels FastCrossEntropyLoss FastRoPE FastRMSLayerNorm. If they only want one would they currently have to set --fast_kernels False True False whereas instead setting --fast_kernels FastRoPE would be easier?

Copy link
Collaborator

@fabianlim fabianlim Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is correct, but unfortunately that will be more complicated than the current implementation.

  • Consider the plugin dataclass (e.g., FusedOpsAndKernelsConfig), see here
  • the plugin dataclass is a nested dataclass; this is because it has dataclasses as members.
  • each member dataclass (e.g., FastKernelsConfig) needs to be parsable by HfArgumentParser, which actually does not support parsing a dataclass type.
  • hence, we made it possible due to our parsable_dataclass decorator, that
    • masquarades the member dataclass as a List, which HfArgumentParser does support lists of a uniform type.
    • allows our member dataclass to contain mixed types by the casting logic implemented in parsable_dataclass via EnsureTypes.

All this logic is needed just.to parse --fast_kernels False True False into the dataclass FastKernelsConfig(fast_loss=False, fast_rsm_layernorm=True, fast_rope_embeddings=False).

To support parsing of the kind --fast_kernels FastRoPE, we need to handle

  • handling of different types. "FastRoPE" is clearly a boolean type, but we also need to handle str inputs, float inputs etc, where it would need to be a key=value pair
  • handling different orders, we need to be able to parse --dataclass_key key_a=a key_b and --dataclass key_b key_a=a equivalently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to merge this PR first, we can come back to this again when we do the fused cross entropy. BTW i left a comment for @achew010 to help upgrade FusedOpsAndKernelsConfig to non-experimental status in this PR by deleting these experimental=True entries, see here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the details, I agree getting this merged and thinking about improvements for fast_kernels later makes sense. Is just the FusedOpsAndKernelsConfig ready to move out of experimental or can this also be done for PaddingFree and Multipack?

Comment on lines 199 to +206
framework = AccelerationFrameworkConfig.from_dataclasses(
quantized_lora_config, fusedops_kernels_config
quantized_lora_config,
fusedops_kernels_config,
attention_and_distributed_packing_config,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the explanation

@anhuong
Copy link
Collaborator

anhuong commented Sep 18, 2024

Also we added the new automation that ensure PRs follow convention commits which you can see is failing -- https://github.com/foundation-model-stack/fms-hf-tuning/actions/runs/10920573842/job/30310716778?pr=280 please address the change

@kmehant kmehant changed the title Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels feat: Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels Sep 18, 2024
@github-actions github-actions bot added the feat label Sep 18, 2024
@anhuong
Copy link
Collaborator

anhuong commented Sep 19, 2024

Please update the branch with the new changes from main and then once the experimental fields are updated this is good to merge in to me 👍

@anhuong
Copy link
Collaborator

anhuong commented Sep 19, 2024

Note @kmehant I think since you requested changes, an approval is needed from your side as well before this can merge

achew010 and others added 14 commits September 20, 2024 02:30
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>
@achew010 achew010 force-pushed the args-for-padding-free-plugin branch from f38c827 to b78936e Compare September 20, 2024 02:31
Copy link
Collaborator

@kmehant kmehant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anhuong thanks for letting me know. Its really annoying that I am not able to dismiss my review some way so that I do not stand as a blocker :( forcing me to push a approval.

Nonetheless, I have used most of these features as part of iLab and undoubtedly vouch for the changes. Thanks.

Copy link
Collaborator

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also mark paddingFree and multiPack as not experimental but LGTM

@anhuong anhuong merged commit 926fb9b into foundation-model-stack:main Sep 20, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants