Experiment XPU QLora Finetuning #8937

yangw1234 · 2023-09-10T04:01:10Z

Description

Example referenced https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing

Current output on llama2-13b (bs=4, gradient_acc_steps=1, warmup_steps=20, max_steps=200)

{'loss': 1.5875, 'learning_rate': 0.0002, 'epoch': 0.03}                                                                                                                                                                                            
{'loss': 1.2684, 'learning_rate': 0.00017777777777777779, 'epoch': 0.06}                                                                                                                                                                            
{'loss': 1.1366, 'learning_rate': 0.00015555555555555556, 'epoch': 0.1}                                                                                                                                                                             
{'loss': 1.046, 'learning_rate': 0.00013333333333333334, 'epoch': 0.13}                                                                                                                                                                             
{'loss': 0.877, 'learning_rate': 0.00011111111111111112, 'epoch': 0.16}                                                                                                                                                                             
{'loss': 0.8815, 'learning_rate': 8.888888888888889e-05, 'epoch': 0.19}                                                                                                                                                                             
{'loss': 1.0122, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.22}                                                                                                                                                                             
{'loss': 0.8274, 'learning_rate': 4.4444444444444447e-05, 'epoch': 0.26}                                                                                                                                                                            
{'loss': 0.9267, 'learning_rate': 2.2222222222222223e-05, 'epoch': 0.29}                                                                                                                                                                            
{'loss': 0.8397, 'learning_rate': 0.0, 'epoch': 0.32}                                                                                                                                                                                               
{'train_runtime': 369.092, 'train_samples_per_second': 2.167, 'train_steps_per_second': 0.542, 'train_loss': 1.0403059482574464, 'epoch': 0.32}                                                                                                     
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [06:09<00:00,  1.85s/it]
TrainOutput(global_step=200, training_loss=1.0403059482574464, metrics={'train_runtime': 369.092, 'train_samples_per_second': 2.167, 'train_steps_per_second': 0.542, 'train_loss': 1.0403059482574464, 'epoch': 0.32})

jason-dai · 2023-09-19T08:55:57Z

python/llm/src/bigdl/llm/transformers/low_bit_linear.py

-            if x_2d.shape[0] > 1 and x_2d.dtype == torch.float32:
+            # sometimes fp16 cause nan and training instability
+            # disable the conversion when training
+            if self.conver_to_half and x_2d.shape[0] > 1 and x_2d.dtype == torch.float32:
                x_2d = x_2d.half()


is half fp16 only? we can try bf16 for training later

Potentially user can still use fp16 or bf16 training by autocast even if self.convert_to_half is false. What we do here is to respect user's intent if he/she decides to use fp32.

autocast often fails

jason-dai · 2023-09-19T09:00:43Z

python/llm/src/bigdl/llm/transformers/qlora.py

+    return model
+
+
+def prepare_model_for_kbit_training(model, use_gradient_checkpointing=True):


do we want to set model.is_loaded_in_4bit to true, so as to directly reuse the original prepare_model_for_kbit_training?

I guess we can, but I think it might trigger some other hard-coded bitsandbytes related behavior in transformers (e.g. cannot call model.to(dtype)), so this will require more testing. I can explore this option in follow up PRs.

jason-dai · 2023-09-19T09:02:01Z

python/llm/example/gpu/qlora_finetuning/README.md

+# you can install specific ipex/torch version for your need
+pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
+pip install git+https://github.com/huggingface/transformers.git@95fe0f5
+pip install peft


need to specify the specific version of peft (and shall we add it to the dependency?)

I added the perf version. I think we can add peft to dependency when we can get a stable release transformers and then we can add/update them together.

jason-dai

LGTM

hkvision · 2023-09-19T11:42:58Z

python/llm/example/gpu/qlora_finetuning/README.md

@@ -0,0 +1,50 @@
+# Q-Lora (experimental support)
+
+This example demonstrate how to finetune a llama2-7b model use Big-LLM 4bit optimizations using [Intel GPUs](../README.md).


demonstrate -> demonstrates

hkvision · 2023-09-19T11:43:06Z

python/llm/example/gpu/qlora_finetuning/README.md

+This example demonstrate how to finetune a llama2-7b model use Big-LLM 4bit optimizations using [Intel GPUs](../README.md).
+
+## 0. Requirements
+To run these examples with BigDL-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.


these examples -> this example

hkvision · 2023-09-19T11:44:46Z

python/llm/src/bigdl/llm/transformers/__init__.py

@@ -21,4 +21,4 @@
        AutoModelForSequenceClassification, AutoModelForMaskedLM, \
        AutoModelForNextSentencePrediction, AutoModelForMultipleChoice, \
        AutoModelForTokenClassification
-from .modelling_bigdl import *
+from .modelling_bigdl import *


revert this?

hkvision · 2023-09-19T11:47:23Z

python/llm/src/bigdl/llm/transformers/qlora.py

+        return result
+
+
+@staticmethod


is this decorator needed?

Weirdly yes. I did not add it at first. It did not work.

jason-dai · 2023-09-20T00:53:01Z

Is there an example of using the finetune model (save/load/inference)?

yangw1234 · 2023-09-20T02:02:05Z

Is there an example of using the finetune model (save/load/inference)?

Working on that.

Currently, user can use the same code to merge the trained adapter and export a merged checkpoint like in alpaca-lora. (from peft import PeftModel)

Using the adapter directly without merging needs our own PeftModel (from bigdl.llm.transformers.qlora import PeftModel. ) So there are some discrepancies and I am looking ways to resolve it.

* Support xpu finetuning * support xpu finetuning * fix style * fix style * fix style * refine example * add readme * refine readme * refine api * fix fp16 * fix example * refactor * fix style * fix compute type * add qlora * refine training args * fix example * fix style * fast path forinference * address comments * refine readme * revert lint

yangw1234 changed the title ~~[WIP] Experiment XPU QLora Finetuning~~ Experiment XPU QLora Finetuning Sep 12, 2023

yangw1234 force-pushed the finetuning branch from 806c87c to c8a48df Compare September 13, 2023 00:07

yangw1234 changed the title ~~Experiment XPU QLora Finetuning~~ [WIP] Experiment XPU QLora Finetuning Sep 14, 2023

yangw1234 changed the title ~~[WIP] Experiment XPU QLora Finetuning~~ Experiment XPU QLora Finetuning Sep 14, 2023

yangw1234 added 17 commits September 15, 2023 07:14

Support xpu finetuning

b15170e

support xpu finetuning

2a04a8f

fix style

5176d58

fix style

72be28a

fix style

4239b5c

refine example

42d4e80

add readme

55302ee

refine readme

065b98a

refine api

3676b9e

fix fp16

cff7b3d

fix example

45adc1f

refactor

5200791

fix style

7546c91

fix compute type

ef05228

add qlora

84dd4a5

refine training args

2da194d

fix example

00c4e8c

yangw1234 force-pushed the finetuning branch from a0dfb57 to 00c4e8c Compare September 14, 2023 23:16

yangw1234 added 2 commits September 15, 2023 07:45

fix style

1eb6d10

fast path forinference

d253636

yangw1234 requested review from jason-dai and hkvision September 18, 2023 23:01

jason-dai reviewed Sep 19, 2023

View reviewed changes

jason-dai approved these changes Sep 19, 2023

View reviewed changes

hkvision reviewed Sep 19, 2023

View reviewed changes

yangw1234 added 3 commits September 20, 2023 00:48

address comments

5e480fb

refine readme

17b8310

revert lint

112d678

yangw1234 merged commit ac542ab into intel-analytics:main Sep 19, 2023
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment XPU QLora Finetuning #8937

Experiment XPU QLora Finetuning #8937

yangw1234 commented Sep 10, 2023 •

edited

Loading

jason-dai Sep 19, 2023

yangw1234 Sep 19, 2023

jason-dai Sep 20, 2023

jason-dai Sep 19, 2023 •

edited

Loading

yangw1234 Sep 19, 2023 •

edited

Loading

jason-dai Sep 19, 2023

yangw1234 Sep 19, 2023

jason-dai left a comment

hkvision Sep 19, 2023

hkvision Sep 19, 2023

hkvision Sep 19, 2023

hkvision Sep 19, 2023

yangw1234 Sep 19, 2023

jason-dai commented Sep 20, 2023

yangw1234 commented Sep 20, 2023

		return model


		def prepare_model_for_kbit_training(model, use_gradient_checkpointing=True):

		@@ -0,0 +1,50 @@
		# Q-Lora (experimental support)

		This example demonstrate how to finetune a llama2-7b model use Big-LLM 4bit optimizations using [Intel GPUs](../README.md).

Experiment XPU QLora Finetuning #8937

Experiment XPU QLora Finetuning #8937

Conversation

yangw1234 commented Sep 10, 2023 • edited Loading

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jason-dai Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

yangw1234 Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jason-dai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jason-dai commented Sep 20, 2023

yangw1234 commented Sep 20, 2023

yangw1234 commented Sep 10, 2023 •

edited

Loading

jason-dai Sep 19, 2023 •

edited

Loading

yangw1234 Sep 19, 2023 •

edited

Loading