Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove fire ported from Hari's PR #303 #324

Conversation

HarikrishnanBalagopal
Copy link
Contributor

@HarikrishnanBalagopal HarikrishnanBalagopal commented Aug 30, 2024

Description of the change

Remove the 2nd CLI parser (fire) because it's not necessary and throws errors when training with LoRA.

Related issue number

Fixes #47

How to verify the PR

make test

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@HarikrishnanBalagopal
Copy link
Contributor Author

HarikrishnanBalagopal commented Aug 31, 2024

The errors are completely unrelated to the changes in the PR.

https://github.com/foundation-model-stack/fms-hf-tuning/actions/runs/10637170607/job/29490507719?pr=324#step:4:1329

E               ValueError: You passed `packing=False` to the SFTTrainer/SFTConfig, but you didn't pass a `dataset_text_field` or `formatting_func` argument.

Seems to be coming from

# below args not needed for pretokenized data
data_formatting_args.data_formatter_template = None
data_formatting_args.dataset_text_field = None
data_formatting_args.response_template = None
# update the training data path to tokenized data
data_formatting_args.training_data_path = dataset_path
train_args = copy.deepcopy(TRAIN_ARGS)
train_args.output_dir = tempdir
sft_trainer.train(MODEL_ARGS, data_formatting_args, train_args)

https://github.com/huggingface/trl/blob/d57e4b726561e5ae58fdc335f34029052944a4a3/trl/trainer/sft_trainer.py#L351

Same error in other PR builds as well https://github.com/foundation-model-stack/fms-hf-tuning/actions/runs/10637170607/job/29490507719?pr=324#step:4:1329

Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ashokponkumar
Copy link
Collaborator

@anhuong can we merge this when you are comfortable with it?

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>
@anhuong
Copy link
Collaborator

anhuong commented Sep 5, 2024

@HarikrishnanBalagopal what testing did you do for this change? We would want to ensure that with this change users can still run training from CLI, with JSON config, and from importing python module. Unit tests cover the python module and JSON config use cases. Thus if you test the CLI piece, then this should be sufficient. Including an example of running with CLI and with target_modules and not getting the error would be great.

@willmj
Copy link
Collaborator

willmj commented Sep 9, 2024

Implementing these changes on my machine:

Running LoRA:

% python tuning/sft_trainer.py \
--model_name_or_path Maykeye/TinyLLama-v0 \
--training_data_path tests/data/twitter_complaints_small.json \
--output_dir outputs/lora-tuning \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-5 \
--response_template "\n### Label:" \
--dataset_text_field "output" \
--use_flash_attn false \
--torch_dtype "float32" \
--peft_method "lora" \
--r 8 \
--lora_dropout 0.05 \
--lora_alpha 16  \
--target_modules "v_proj" "c_proj"

Log:

{'loss': 7.9407, 'grad_norm': 4.002565860748291, 'learning_rate': 8.000000000000001e-06, 'epoch': 1.0}
 20%|█████████                                    | 1/5 [00:00<00:02,  1.74it/s]
{'loss': 3.9542, 'grad_norm': 2.8579704761505127, 'learning_rate': 4.000000000000001e-06, 'epoch': 2.0}
 60%|███████████████████████████                  | 3/5 [00:01<00:00,  3.46it/s]
 80%|████████████████████████████████████         | 4/5 [00:01<00:00,  2.96it/s]
{'loss': 3.8709, 'grad_norm': 1.6323113441467285, 'learning_rate': 0.0, 'epoch': 3.0}
100%|█████████████████████████████████████████████| 5/5 [00:01<00:00,  2.96it/s]
{'train_runtime': 1.7348, 'train_samples_per_second': 28.822, 'train_steps_per_second': 2.882, 'train_loss': 4.7181743621826175, 'epoch': 3.0}
100%|█████████████████████████████████████████████| 5/5 [00:01<00:00,  2.88it/s]

Fine Tuning

% python tuning/sft_trainer.py \
--model_name_or_path Maykeye/TinyLLama-v0 \
--training_data_path tests/data/twitter_complaints_small.jsonl \
--output_dir outputs/full-tuning \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-5 \
--response_template "\n### Label:" \
--dataset_text_field "output" \
--use_flash_attn false \
--torch_dtype "float32"

Log:

{'loss': 7.9407, 'grad_norm': 23.814661026000977, 'learning_rate': 8.000000000000001e-06, 'epoch': 1.0}
{'loss': 3.8985, 'grad_norm': 17.079011917114258, 'learning_rate': 4.000000000000001e-06, 'epoch': 2.0}
{'loss': 3.7799, 'grad_norm': 9.279784202575684, 'learning_rate': 0.0, 'epoch': 3.0}
{'train_runtime': 1.6872, 'train_samples_per_second': 29.635, 'train_steps_per_second': 2.963, 'train_loss': 4.659521102905273, 'epoch': 3.0}
100%|█████████████████████████████████████████████| 5/5 [00:01<00:00,  2.96it/s]

So it seems to run with no problem from CLI

@anhuong
Copy link
Collaborator

anhuong commented Sep 10, 2024

Thanks for running the test Will!

@anhuong anhuong merged commit 32b751c into foundation-model-stack:main Sep 10, 2024
7 checks passed
aluu317 pushed a commit to aluu317/fms-hf-tuning that referenced this pull request Sep 13, 2024
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Signed-off-by: Angel Luu <angel.luu@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix CLI error for passing in target_modules
5 participants