fix: remove fire ported from Hari's PR #303 #324

HarikrishnanBalagopal · 2024-08-30T18:02:24Z

Description of the change

Remove the 2nd CLI parser (fire) because it's not necessary and throws errors when training with LoRA.

Related issue number

Fixes #47

How to verify the PR

make test

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

HarikrishnanBalagopal · 2024-08-31T08:09:29Z

The errors are completely unrelated to the changes in the PR.

https://github.com/foundation-model-stack/fms-hf-tuning/actions/runs/10637170607/job/29490507719?pr=324#step:4:1329

E               ValueError: You passed `packing=False` to the SFTTrainer/SFTConfig, but you didn't pass a `dataset_text_field` or `formatting_func` argument.

Seems to be coming from

fms-hf-tuning/tests/test_sft_trainer.py

Lines 509 to 520 in 6cfb2ff

    
           # below args not needed for pretokenized data 
        
           data_formatting_args.data_formatter_template = None 
        
           data_formatting_args.dataset_text_field = None 
        
           data_formatting_args.response_template = None 
        
           # update the training data path to tokenized data 
        
           data_formatting_args.training_data_path = dataset_path 
        
           train_args = copy.deepcopy(TRAIN_ARGS) 
        
           train_args.output_dir = tempdir 
        
           sft_trainer.train(MODEL_ARGS, data_formatting_args, train_args)

https://github.com/huggingface/trl/blob/d57e4b726561e5ae58fdc335f34029052944a4a3/trl/trainer/sft_trainer.py#L351

Same error in other PR builds as well https://github.com/foundation-model-stack/fms-hf-tuning/actions/runs/10637170607/job/29490507719?pr=324#step:4:1329

willmj

LGTM!

ashokponkumar · 2024-09-03T20:38:24Z

@anhuong can we merge this when you are comfortable with it?

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>

anhuong · 2024-09-05T21:01:55Z

@HarikrishnanBalagopal what testing did you do for this change? We would want to ensure that with this change users can still run training from CLI, with JSON config, and from importing python module. Unit tests cover the python module and JSON config use cases. Thus if you test the CLI piece, then this should be sufficient. Including an example of running with CLI and with target_modules and not getting the error would be great.

willmj · 2024-09-09T18:48:23Z

Implementing these changes on my machine:

Running LoRA:

% python tuning/sft_trainer.py \
--model_name_or_path Maykeye/TinyLLama-v0 \
--training_data_path tests/data/twitter_complaints_small.json \
--output_dir outputs/lora-tuning \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-5 \
--response_template "\n### Label:" \
--dataset_text_field "output" \
--use_flash_attn false \
--torch_dtype "float32" \
--peft_method "lora" \
--r 8 \
--lora_dropout 0.05 \
--lora_alpha 16  \
--target_modules "v_proj" "c_proj"

Log:

{'loss': 7.9407, 'grad_norm': 4.002565860748291, 'learning_rate': 8.000000000000001e-06, 'epoch': 1.0}
 20%|█████████                                    | 1/5 [00:00<00:02,  1.74it/s]
{'loss': 3.9542, 'grad_norm': 2.8579704761505127, 'learning_rate': 4.000000000000001e-06, 'epoch': 2.0}
 60%|███████████████████████████                  | 3/5 [00:01<00:00,  3.46it/s]
 80%|████████████████████████████████████         | 4/5 [00:01<00:00,  2.96it/s]
{'loss': 3.8709, 'grad_norm': 1.6323113441467285, 'learning_rate': 0.0, 'epoch': 3.0}
100%|█████████████████████████████████████████████| 5/5 [00:01<00:00,  2.96it/s]
{'train_runtime': 1.7348, 'train_samples_per_second': 28.822, 'train_steps_per_second': 2.882, 'train_loss': 4.7181743621826175, 'epoch': 3.0}
100%|█████████████████████████████████████████████| 5/5 [00:01<00:00,  2.88it/s]

Fine Tuning

% python tuning/sft_trainer.py \
--model_name_or_path Maykeye/TinyLLama-v0 \
--training_data_path tests/data/twitter_complaints_small.jsonl \
--output_dir outputs/full-tuning \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-5 \
--response_template "\n### Label:" \
--dataset_text_field "output" \
--use_flash_attn false \
--torch_dtype "float32"

Log:

{'loss': 7.9407, 'grad_norm': 23.814661026000977, 'learning_rate': 8.000000000000001e-06, 'epoch': 1.0}
{'loss': 3.8985, 'grad_norm': 17.079011917114258, 'learning_rate': 4.000000000000001e-06, 'epoch': 2.0}
{'loss': 3.7799, 'grad_norm': 9.279784202575684, 'learning_rate': 0.0, 'epoch': 3.0}
{'train_runtime': 1.6872, 'train_samples_per_second': 29.635, 'train_steps_per_second': 2.963, 'train_loss': 4.659521102905273, 'epoch': 3.0}
100%|█████████████████████████████████████████████| 5/5 [00:01<00:00,  2.96it/s]

So it seems to run with no problem from CLI

anhuong · 2024-09-10T15:23:39Z

Thanks for running the test Will!

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> Signed-off-by: Anh Uong <anh.uong@ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

HarikrishnanBalagopal requested review from anhuong, Ssukriti and alex-jw-brooks as code owners August 30, 2024 18:02

HarikrishnanBalagopal force-pushed the fix/remove2ndcliparser branch from eb8d6c3 to 85612a6 Compare August 30, 2024 18:04

willmj mentioned this pull request Aug 30, 2024

fix: remove fire.fire(main), replace with main() #323

Closed

2 tasks

HarikrishnanBalagopal mentioned this pull request Sep 2, 2024

fix: need to pass skip_prepare_dataset for pretokenized dataset due to breaking change in HF SFTTrainer #326

Merged

2 tasks

willmj self-requested a review September 3, 2024 15:40

anhuong closed this in #326 Sep 3, 2024

ashokponkumar reopened this Sep 3, 2024

HarikrishnanBalagopal force-pushed the fix/remove2ndcliparser branch from 85612a6 to b703b63 Compare September 3, 2024 19:41

willmj approved these changes Sep 3, 2024

View reviewed changes

fix: remove fire ported from Hari's PR foundation-model-stack#303

5ca6ee4

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>

HarikrishnanBalagopal force-pushed the fix/remove2ndcliparser branch from b703b63 to 5ca6ee4 Compare September 4, 2024 17:57

Merge branch 'main' into fix/remove2ndcliparser

0df5b69

anhuong approved these changes Sep 10, 2024

View reviewed changes

anhuong merged commit 32b751c into foundation-model-stack:main Sep 10, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove fire ported from Hari's PR #303 #324

fix: remove fire ported from Hari's PR #303 #324

HarikrishnanBalagopal commented Aug 30, 2024 •

edited

Loading

HarikrishnanBalagopal commented Aug 31, 2024 •

edited

Loading

willmj left a comment

ashokponkumar commented Sep 3, 2024

anhuong commented Sep 5, 2024 •

edited

Loading

willmj commented Sep 9, 2024 •

edited

Loading

anhuong commented Sep 10, 2024

fix: remove fire ported from Hari's PR #303 #324

fix: remove fire ported from Hari's PR #303 #324

Conversation

HarikrishnanBalagopal commented Aug 30, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

HarikrishnanBalagopal commented Aug 31, 2024 • edited Loading

willmj left a comment

Choose a reason for hiding this comment

ashokponkumar commented Sep 3, 2024

anhuong commented Sep 5, 2024 • edited Loading

willmj commented Sep 9, 2024 • edited Loading

anhuong commented Sep 10, 2024

HarikrishnanBalagopal commented Aug 30, 2024 •

edited

Loading

HarikrishnanBalagopal commented Aug 31, 2024 •

edited

Loading

anhuong commented Sep 5, 2024 •

edited

Loading

willmj commented Sep 9, 2024 •

edited

Loading