feat: dataclass args for accelerated MoE tuning #390

willmj · 2024-11-15T20:51:41Z

Description of the change

This PR adds one dataclass argument to enable accelerted moe for sft_trainer.py, via the new fms-acceleration accelerated-moe plugin and allows for accelerated MoE full-finetuning with the --fast_moe flag. --fast_moe enables a technique to train Mixture of Expert (MoE) models in parallel instead of sequentially.
With this flag, we expect major speedup in train time and decrease in memory usage on Mixture of Expert models.

Framework Config	EP Degree (parameter)	Model		Train Runtime	Speedup	Memory Usage	Memory Savings
none	N/A	granite 3b a800	1	2371.93	base	71199	base
Scatter MoE	1	granite 3b a800	1	742.739	3.19	71187	1.0
Scatter MoE + Padding Free	1	granite 3b a800	1	631.976	3.75	48401	0.68
Scatter MoE + Padding Free + foak	1	granite 3b a800	1	615.453	3.85	42651	0.6
none	N/A	mixtral 8x7b	8	4180.95	base	65607	base
Scatter MoE	8	mixtral 8b7x	8	1071.2	3.9	52004.8	0.79
Scatter MoE + Padding Free + foak	8	mixtral 8x7b	8	1043.67	4.01	51961.2	0.79

Related issue number

How to verify the PR

This PR is a work-in-progress and requires more testing, and the official release of fms-acceleration-moe

To verify, run a tuning job with fast_moe.
Run a tuning job with other plugins added on top of fast_moe
Ensure that incorrect parameters result in failures
Ensure that non-MoE models cannot be trained with this plugin set

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

github-actions · 2024-11-15T20:51:55Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

tuning/config/acceleration_configs/acceleration_framework_config.py

tuning/config/acceleration_configs/fast_moe.py

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj · 2024-11-21T20:55:23Z

Tested using new flag on granite 3 3b MoE, inference up next

Regular MOE tuning

Tested this branch without fast_moe

      {
          "model_name_or_path": "/ibm_dmf_lakehouse/models/base_training/shared/granite-3.0-3b-a800m-base/r240924a",
          "training_data_path": "/testing/tuning/input/cc_tone_sft_format_1000_train.json",
          "output_dir": "/testing/tuning/output/granite-3b-moe/ft/20241120_1014-tone",
          "save_model_dir": "/testing/tuning/output/granite-3b-moe/ft/20241120_1014-tone/save_model",
          "num_train_epochs": 10.0,
          "per_device_train_batch_size": 2,
          "gradient_accumulation_steps": 1,
          "learning_rate": 1e-5,
          "response_template": "\n### Response:",
          "dataset_text_field": "output"
      }

Training logs:

{'loss': 0.8331, 'grad_norm': 364.0, 'learning_rate': 9e-06, 'epoch': 1.0}
{'loss': 0.4259, 'grad_norm': 0.10986328125, 'learning_rate': 8.000000000000001e-06, 'epoch': 2.0}
{'loss': 0.1667, 'grad_norm': 25.25, 'learning_rate': 7e-06, 'epoch': 3.0}
{'loss': 0.0304, 'grad_norm': 21.625, 'learning_rate': 6e-06, 'epoch': 4.0}
{'loss': 0.0023, 'grad_norm': 0.005828857421875, 'learning_rate': 5e-06, 'epoch': 5.0}
{'loss': 0.0004, 'grad_norm': 0.005157470703125, 'learning_rate': 4.000000000000001e-06, 'epoch': 6.0}
{'loss': 0.0001, 'grad_norm': 0.0038604736328125, 'learning_rate': 3e-06, 'epoch': 7.0}
{'loss': 0.0001, 'grad_norm': 0.000469207763671875, 'learning_rate': 2.0000000000000003e-06, 'epoch': 8.0}
{'loss': 0.0001, 'grad_norm': 0.004547119140625, 'learning_rate': 1.0000000000000002e-06, 'epoch': 9.0}
{'loss': 0.0001, 'grad_norm': 0.01324462890625, 'learning_rate': 0.0, 'epoch': 10.0}
{'train_runtime': 5311.528, 'train_samples_per_second': 1.883, 'train_steps_per_second': 0.941, 'train_loss': 0.1459229184500873, 'epoch': 10.0}

Location: /testing/tuning/output/granite-3b-moe/ft/20241121_1314-tone/save_model

Fast MOE

And with fast_moe:

      {
          "model_name_or_path": "/ibm_dmf_lakehouse/models/base_training/shared/granite-3.0-3b-a800m-base/r240924a",
          "training_data_path": "/testing/tuning/input/cc_tone_sft_format_1000_train.json",
          "output_dir": "/testing/tuning/output/granite-3b-moe/ft/20241121_1014-tone-FAST",
          "save_model_dir": "/testing/tuning/output/granite-3b-moe/ft/20241121_1014-tone-FAST/save_model",
          "num_train_epochs": 10.0,
          "per_device_train_batch_size": 2,
          "gradient_accumulation_steps": 1,
          "learning_rate": 1e-5,
          "response_template": "\n### Response:",
          "dataset_text_field": "output",
          "fast_moe": 1
      }

Training logs

{'loss': 0.4279, 'grad_norm': 0.076171875, 'learning_rate': 8.000000000000001e-06, 'epoch': 2.0}
{'loss': 0.1377, 'grad_norm': 3.78125, 'learning_rate': 7e-06, 'epoch': 3.0}
{'loss': 0.0384, 'grad_norm': 0.81640625, 'learning_rate': 6e-06, 'epoch': 4.0}
{'loss': 0.0031, 'grad_norm': 0.003997802734375, 'learning_rate': 5e-06, 'epoch': 5.0}
{'loss': 0.0006, 'grad_norm': 0.002044677734375, 'learning_rate': 4.000000000000001e-06, 'epoch': 6.0}
{'loss': 0.0002, 'grad_norm': 0.0032196044921875, 'learning_rate': 3e-06, 'epoch': 7.0}
{'loss': 0.0001, 'grad_norm': 0.002288818359375, 'learning_rate': 2.0000000000000003e-06, 'epoch': 8.0}
{'loss': 0.0001, 'grad_norm': 0.0087890625, 'learning_rate': 1.0000000000000002e-06, 'epoch': 9.0}
{'loss': 0.0001, 'grad_norm': 0.0115966796875, 'learning_rate': 0.0, 'epoch': 10.0}
{'train_runtime': 2140.2943, 'train_samples_per_second': 4.672, 'train_steps_per_second': 2.336, 'train_loss': 0.14420232288464904, 'epoch': 10.0}

Location: /testing/tuning/output/granite-3b-moe/ft/20241121_1315-tone-FAST/save_model

Results

We see a 2.48x speedup

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

tuning/config/acceleration_configs/utils.py

fabianlim · 2024-11-22T00:52:33Z

@willmj In the original PR we reported the benches where the batch sizes are different, but the numbers that you report here are around that ballpark.

c.f., the numbers in the bench in a table (from the original PR)

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj · 2024-12-09T18:27:33Z

After running checkpoint utils on the branch Fabian created for safetensors, vLLM inference ran as expected:

% grpcurl -plaintext -proto ./proto/generation.proto -d "{\"params\":{\"method\":\"GREEDY\", \"stopping\": {\"max_new_tokens\": 128}}, \"requests\": [{\"text\":\"### Text: @sho_help @showtime your arrive is terrible streaming is stop and start every couple mins. Get it together it's xmas\n\n### Label:\"}]}" localhost:8033 fmaas.GenerationService/Generate
{
  "responses": [
    {
      "generatedTokenCount": 128,
      "text": " sad, frustrated, anxious, anxious, frustrated, sad, anxious, anxious, frustrated, sad, frustrated, anxious, frustrated, sad, frustrated, anxious, sad, frustrated, anxious, sad, frustrated, anxious, sad, frustrated, anxious, sad, frustrated, anxious, sad, frustrated, anxious, sad, frustrated, anxious, sad, frustrated, anxious, sad,",
      "inputTokenCount": 38,
      "stopReason": "MAX_TOKENS"
    }
  ]
}

Post-processing completed with this script (thanks again Fabian!):

from fms_acceleration_moe.utils.checkpoint_utils import get_state_dict_from_safe_checkpoint, recover_original_state_dict_from_checkpoint, save_single_safetensor
from safetensors.torch import save_file
from transformers.utils import SAFE_WEIGHTS_NAME, CONFIG_NAME
import os, shutil, json

checkpoint_dir = '<scattermoe-checkpoing-dir>'

output_dir = '<output-dir>'
pretrained_model_name_or_path = '<original-model-dir>'

config_file = os.path.join(checkpoint_dir, CONFIG_NAME)
target_config_file = os.path.join(output_dir, CONFIG_NAME)
if os.path.exists(config_file):
    shutil.copyfile(config_file, target_config_file)

    if not pretrained_model_name_or_path:
        with open(target_config_file) as f:
            pretrained_model_name_or_path = json.load(f).get("_name_or_path")


sd = get_state_dict_from_safe_checkpoint(checkpoint_dir)

sd = recover_original_state_dict_from_checkpoint(
    sd, pretrained_model_name_or_path
)


save_single_safetensor(
    {k: v.contiguous() for k, v in sd.items()},
    output_dir,
    metadata={"format": "pt"},
)


from transformers import AutoModelForCausalLM

# test if we can load the converted state dict
model = AutoModelForCausalLM.from_pretrained(output_dir)

FastMOE model saved in: /testing/tuning/output/granite-3b-moe/ft/20241121_1315-tone-FAST/save_model
Reconstructed SD model saved in: /testing/tuning/output/granite-3b-moe/ft/20241121_1315-tone-FAST/standard-sd

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj added 8 commits November 12, 2024 13:54

feat: accelerated MoE dataclass and init

de2304b

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: author's note

83318ef

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

feat: accelerated moe in acceleration framework

3c109dd

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

feat: accelerated moe to sft_trainer

87dc94e

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

feat: fmt, testing

8feba5f

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: rename accelerated moe to fast moe

f8a45a0

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

test: add testing for scatter moe on accel framework

549f1af

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: model, dtype, assertions

ac757a8

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

github-actions bot added the feat label Nov 15, 2024

fabianlim reviewed Nov 16, 2024

View reviewed changes

tuning/config/acceleration_configs/acceleration_framework_config.py Outdated Show resolved Hide resolved

fabianlim reviewed Nov 16, 2024

View reviewed changes

tuning/config/acceleration_configs/fast_moe.py Outdated Show resolved Hide resolved

willmj added 2 commits November 18, 2024 11:55

fix: post init check removed from FastMoe, experimental set to True

51635d6

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: if non-iterable nested dataclass, still initialize

d99ca61

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj added 4 commits November 21, 2024 16:08

test: add failing test for wrong ep_degree

07980a9

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: actually expect failure

6cfa6f1

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

test: make sure fast moe doesn't work with non-moe model

91c5494

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: regex of new test

65adac5

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fabianlim reviewed Nov 22, 2024

View reviewed changes

tuning/config/acceleration_configs/utils.py Show resolved Hide resolved

willmj added 2 commits November 21, 2024 21:58

comment: explain iterable unpacking

0c7275f

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

docs: fast MOE in README

5c8d42a

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj added 2 commits December 9, 2024 13:36

docs: Add note for post-processing

282f4e4

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

merge: branch 'main' into feat-dataclass-args-scattermoe

592d1c2

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj marked this pull request as ready for review December 9, 2024 18:49

willmj requested review from anhuong, Ssukriti and aluu317 as code owners December 9, 2024 18:49

willmj requested a review from kmehant as a code owner December 9, 2024 18:49

fabianlim changed the title ~~feat: [WIP] dataclass args for accelerated MoE tuning~~ feat: dataclass args for accelerated MoE tuning Dec 10, 2024

Merge branch 'main' into feat-dataclass-args-scattermoe

9147ef5

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dataclass args for accelerated MoE tuning #390

feat: dataclass args for accelerated MoE tuning #390

willmj commented Nov 15, 2024 •

edited

Loading

github-actions bot commented Nov 15, 2024

willmj commented Nov 21, 2024 •

edited

Loading

fabianlim commented Nov 22, 2024 •

edited

Loading

willmj commented Dec 9, 2024 •

edited

Loading

feat: dataclass args for accelerated MoE tuning #390

Are you sure you want to change the base?

feat: dataclass args for accelerated MoE tuning #390

Conversation

willmj commented Nov 15, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Nov 15, 2024

willmj commented Nov 21, 2024 • edited Loading

Regular MOE tuning

Fast MOE

Results

fabianlim commented Nov 22, 2024 • edited Loading

willmj commented Dec 9, 2024 • edited Loading

willmj commented Nov 15, 2024 •

edited

Loading

willmj commented Nov 21, 2024 •

edited

Loading

fabianlim commented Nov 22, 2024 •

edited

Loading

willmj commented Dec 9, 2024 •

edited

Loading