i got a Trainer error: Attempting to unscale FP16 gradients #23165

han508 · 2023-05-05T11:03:11Z

System Info

transformers version: 4.28.1
Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.27
Python version: 3.9.16
Huggingface_hub version: 0.13.4
Safetensors version: not installed
PyTorch version (GPU?): 1.13.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:
device : Tesla T4*4
CUDA-11.6

Who can help?

@sgugger Now, when I add fp16=True, i get the error:
ValueError: Attempting to unscale FP16 gradients. when running trainer.train()

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import LlamaTokenizer, LlamaForCausalLM,AutoTokenizer,AutoModelForSeq2SeqLM, LlamaConfig

from peft import prepare_model_for_int8_training, LoraConfig, get_peft_model, get_peft_model_state_dict
merge_tokenizer = LlamaTokenizer.from_pretrained('/home/han/new_store/Llama/merged_tokenizer_hf',padding=True, truncation=True)
print(len(merge_tokenizer))
n = merge_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
len(merge_tokenizer)

from datasets import load_dataset
dataset = load_dataset("json", data_files="./data/alpaca_data_zh_51k.json")

dataset = dataset.filter(lambda x: x["output"]!=None)
dataset = dataset.filter(lambda x: x["input"] !=None)

def preprocess_function(sample):

l = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.</s>Human:"
for i in range(len(sample['instruction'])):
    if sample['input'][i]!='':
        sample['instruction'][i]=l+sample['instruction'][i]+'[PAD]'+sample['input'][i]
      
        # print(sample['input'][i])
output = ['Assistant:'+i for i in sample['output']]
model_inputs = merge_tokenizer(sample['instruction'], truncation=True,padding=True,max_length=200)
labels = merge_tokenizer(output,  truncation=True, padding=True,max_length=200)
model_inputs["labels"] = labels["input_ids"]
# print(model_inputs)
return model_inputs

input_data = dataset['train'].map(preprocess_function,batched=True,remove_columns=['instruction','input','output'])

import torch

model = LlamaForCausalLM.from_pretrained('decapoda-research/llama-7b-hf',device_map='auto',cache_dir='./cache/',torch_dtype=torch.float16)
model.resize_token_embeddings(len(merge_tokenizer))

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
trainArgs = TrainingArguments(
output_dir= '../ckps_emb',
do_train=True,
# per_device_train_batch_size=4,
auto_find_batch_size=True,
fp16=True,
gradient_accumulation_steps=4,
evaluation_strategy="steps",
save_strategy="steps",
save_steps=1000,
eval_steps=1000,
logging_steps=20,
warmup_steps=100,
num_train_epochs=2,
learning_rate=5e-4,
load_best_model_at_end=True,

report_to="wandb"

)

for name, param in model.named_parameters():
param.requires_grad_(False)
if name =='model.embed_tokens.weight':
param.requires_grad_(True)
print(name, "requires_grad:", param.requires_grad)

trainer = Trainer(
model=model,
args=trainArgs,
train_dataset=input_data,
eval_dataset=input_data,
data_collator=DataCollatorForLanguageModeling(merge_tokenizer, mlm=False),
)
model.config.use_cache = True
trainer.train()
model.save_pretrained('../ckps/demo_llama71_full')

Expected behavior

i except it does not give a error ,ValueError:Attempting to unscale FP16 gradients.

The text was updated successfully, but these errors were encountered:

sgugger · 2023-05-05T15:38:41Z

You can't train a model loaded in FP16:

model = LlamaForCausalLM.from_pretrained(xxx, torch_dtype=torch.float16)

is the culprit here. I don't know how PEFT initializes the layer to train afterwards, but some of them must be in the same dtype cc @younesbelkada

younesbelkada · 2023-05-05T15:47:06Z

I second what @sgugger said,
however I see that you're importing peft but doing nothing with it, also make sure to use the latest peft release as it contains some bug fixes.

pip install --upgrade peft

In my opinion, to use PEFT at its best, you should load your model in 8bit as follows:

from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training
from transformers import LlamaTokenizer, LlamaForCausalLM

path_to_llama = xxx
model = LlamaForCausalLM.from_pretrained(
    path_to_llama,
    device_map="auto",
    load_in_8bit=True
)

tokenizer = LlamaTokenizer.from_pretrained(path_to_llama)

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = prepare_model_for_int8_training(model)
model = get_peft_model(model, config)

... # get your dataset etc here
trainer = Trainer(
    model=model,
    ...
)

Also make sure to use transformers latest release as well:

pip install --upgrade transformers

younesbelkada · 2023-05-05T15:47:50Z

For reference, I would have a look at how the PEFT slow tests are designed, check here: https://github.com/huggingface/peft/blob/b1059b73aab9043b118ff19b0cf96263ea86248a/tests/test_gpu_examples.py#L114

han508 · 2023-05-06T11:56:29Z

Thank you for your reply, when I update the latest PEFT and transformers, All problems are resolved.

wang-yiwei · 2023-05-08T10:54:45Z

You can't train a model loaded in FP16:
model = LlamaForCausalLM.from_pretrained(xxx, torch_dtype=torch.float16)
is the culprit here. I don't know how PEFT initializes the layer to train afterwards, but some of them must be in the same dtype cc @younesbelkada

Thanks for the answer, it saved me some time to test if it is possible to fine tune a model loaded in FP16.
But what about models loaded in 8bit? Can I just fine tune the model with an 8-bit optimiser without using any PEFT techniques such as LoRA?
If I can't tune a model loaded in 8bit, I wonder why we are allowed to use LoRA to fine tune the model?

wang-yiwei · 2023-05-08T11:12:45Z

I second what @sgugger said, however I see that you're importing peft but doing nothing with it, also make sure to use the latest peft release as it contains some bug fixes.

pip install --upgrade peft

In my opinion, to use PEFT at its best, you should load your model in 8bit as follows:

from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training
from transformers import LlamaTokenizer, LlamaForCausalLM

path_to_llama = xxx
model = LlamaForCausalLM.from_pretrained(
    path_to_llama,
    device_map="auto",
    load_in_8bit=True
)

tokenizer = LlamaTokenizer.from_pretrained(path_to_llama)

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = prepare_model_for_int8_training(model)
model = get_peft_model(model, config)

... # get your dataset etc here
trainer = Trainer(
    model=model,
    ...
)

Also make sure to use transformers latest release as well:

pip install --upgrade transformers

Hi Younes, thank you for your work on PEFT.
Recently I read some papers measuring the performance difference between full fine-tuning and lora-based fine-tuning. There's actually a huge difference between the tuned models in terms of their benchmarks/metrics.
Here're the links to the publications:

https://arxiv.org/abs/2304.14454

https://arxiv.org/abs/2304.08109

I am very grateful that we have these open-source fine-tuning techniques. But I am curious about your opinions on the performance trade-off between lora and full-tuing?

Thanks for your concerns.

younesbelkada · 2023-05-08T18:24:15Z

Hi @IwanVan
Thanks for your reply and your interests, I will answer to your questions to the best of my knowlegde

1- Sadly it is not possible to do pure int8 training, (i.e. pass the full 8bit model to the optimizer state) as I believe this will result in a very unstable training as your weight matrix can be only represented in 8bit precision (256 possible values), so the model won't probably learn anything. Although it's not possible to train in pure fp16 (from my understanding), you can train your model in a precision called bfloat16 (simply pass torch_dtype=torch.bfloat16), that has the same training dynamics as float32, and that is commonly used to train large scale models.
We have made a detailed blogpost about that here that I invite your to have a look.

2- This seems to be a new paper so it's the first time I go through, and from my understanding it tries to fine-tune Llama to the medical paper domain. I agree the differences here sound quite large. Thinking about it loud, maybe the domain gap was too high for that model but I am not sure. Empirically it has been showed (from the original paper and from what I have seen so far) that you can get very comparable results (sometimes better results) than full finetuning when going for PEFT methods (and on all modalities, vision, text, RLHF, etc.), so I would say it would really depend on your usecase, dataset, etc..
Note that with PEFT you can fit into your devices the model + optimizer states of very large models! In this blogpost we show how to fit a 20B model into a 24GB GPU and train that model. This is totally not possible when going for full-finetuning. I would say this is the main (and big) advantage of PEFT methods. cc also @pacman100 that would probably have more insights here!

Thanks!

younesbelkada · 2023-05-08T18:28:26Z

If I can't tune a model loaded in 8bit, I wonder why we are allowed to use LoRA to fine tune the model?

Because in the case of tuning the LoRA layers, the base model will stay untouched, in 8bit, but the LoRA layers that we're going to train will be kept in full precision (float32)

wang-yiwei · 2023-05-08T22:06:05Z

Hi @IwanVan Thanks for your reply and your interests, I will answer to your questions to the best of my knowlegde

1- Sadly it is not possible to do pure int8 training, (i.e. pass the full 8bit model to the optimizer state) as I believe this will result in a very unstable training as your weight matrix can be only represented in 8bit precision (256 possible values), so the model won't probably learn anything. Although it's not possible to train in pure fp16 (from my understanding), you can train your model in a precision called bfloat16 (simply pass torch_dtype=torch.bfloat16), that has the same training dynamics as float32, and that is commonly used to train large scale models. We have made a detailed blogpost about that here that I invite your to have a look.

2- This seems to be a new paper so it's the first time I go through, and from my understanding it tries to fine-tune Llama to the medical paper domain. I agree the differences here sound quite large. Thinking about it loud, maybe the domain gap was too high for that model but I am not sure. Empirically it has been showed (from the original paper and from what I have seen so far) that you can get very comparable results (sometimes better results) than full finetuning when going for PEFT methods (and on all modalities, vision, text, RLHF, etc.), so I would say it would really depend on your usecase, dataset, etc.. Note that with PEFT you can fit into your devices the model + optimizer states of very large models! In this blogpost we show how to fit a 20B model into a 24GB GPU and train that model. This is totally not possible when going for full-finetuning. I would say this is the main (and big) advantage of PEFT methods. cc also @pacman100 that would probably have more insights here!

Thanks!

Hi @younesbelkada , thanks again for your quick response.

I actually have implemented a lot of your example codes from the Peft lib already.
Also the load_in_8bit support backed by bnb is really impressive, and I've used it for zero-/ few-shot inference with LLM on a single 4090.
For training, I have implemented almost every factors that were mention in Efficient Training on a Single GPU by using the HF trainer. However, the largest model that I can tune in full precision is flan-t5-3B with very efficient setup and a new GPU-friendly optimizer called Lion, but in 8bit version.
Personally I am very excited about efficient fine-tuning techniques such as Lora, and I have carefully examined the code for AdaLoRA and a newer technique called Ladder Side-Tuning (LST), and I have asked the authors if they intend to integrate this technique into the peft library.
However, the reason I have been on the fence for the last two weeks with regard to peft techniques such as lora is that there is a growing number of papers appearing which fine-tune models using peft techniques based on some very new auto-regressive models. An increasing number of studies show that lora seems to have significant robustness problems for training of domain-specific (medical) and other language (Chinese) instructions. In these papers, lora lags behind full fine-tuning almost across the board in all metrics. Certainly I agree with your analysis of the causes above, and I am not in a hurry to draw conclusions about the results from these papers, as new technologies need to be viewed rationally.

But I wonder if I could open a new issue in the peft repository to follow up on the current new research on peft/lora and see if I could find a reasonable explanation for the difference in performance across different fine-tuning techniques by documenting and analysing similar papers over time and get more developers involved in the discussion?

Regards,
Wang

lucasjinreal · 2023-05-24T03:39:02Z

@younesbelkada Hello, I load 7B llama for peft Lora finetune on a single v100 but got OOM, is that normal?

am using default float(32).

Does it have to be load in in8 for lora finetuning?

lucasjinreal · 2023-05-24T03:55:22Z

@younesbelkada after load in in8, I got error like this:

RuntimeError: expected scalar type Half but found Float

I swear i have no where set float16 in my code.....

younesbelkada · 2023-05-24T10:06:17Z

hi @lucasjinreal
Do you used prepare_model_for_int8_training on your script? You need to call that method before calling get_peft_model

lucasjinreal · 2023-05-24T11:08:09Z

@younesbelkada I noticed that LION merge into master, when will it update to pip btw?

Do you used prepare_model_for_int8_training on your script?

Yes, I have used. after I set fp16=False it now works.

But, do u know why 32GB unable to train with float32? Am have to using deepspeed to offload now, and int8 training seems slower than offload

younesbelkada · 2023-05-24T11:28:49Z

hi @lucasjinreal
It should be already in pip there should be an announcement soon about that :)

Yes, I have used. after I set fp16=False it now works.

Awesome!

But, do u know why 32GB unable to train with float32? Am have to using deepspeed to offload now, and int8 training seems slower than offload

Yes int8 can be slower in some cases, you might be interested in using FP4 quantization that should be much faster, it will be part of the announcement today as well. I will keep you posted

Relevant links: https://github.com/artidoro/qlora & #23479

lucasjinreal · 2023-05-24T11:31:35Z

@younesbelkada Looking forward to it, do u mean fp4 training? Looks like only decent GPU like H100 support it. Will transformers new release incuding this as well?

don-tpanic · 2023-10-10T21:34:20Z

You can't train a model loaded in FP16:
model = LlamaForCausalLM.from_pretrained(xxx, torch_dtype=torch.float16)
is the culprit here. I don't know how PEFT initializes the layer to train afterwards, but some of them must be in the same dtype cc @younesbelkada

Could you explain what you mean by cannot train a fp16 model? Is it because you would need a fp32 copy of weights for fp16 mixed precision training?

DnkNju · 2024-10-18T08:15:11Z

您无法训练以 FP16 加载的模型：
model = LlamaForCausalLM.from_pretrained(xxx, torch_dtype=torch.float16)
是罪魁祸首。我不知道 PEFT 如何初始化随后要训练的层，但其中一些必须在相同的 dtype cc 中@younesbelkada

However, my code uses flashattention, flashattention does not support float32, only bf16 and fp16, what should I do? This seems to create a contradiction

han508 closed this as completed May 6, 2023

cnut1648 mentioned this issue Jul 14, 2023

Model training with torch_dtype=torch.bfloat16 is possible? #24819

Closed

4 tasks

hengjiUSTC mentioned this issue Jan 7, 2024

ValueError: Attempting to unscale FP16 gradients. axolotl-ai-cloud/axolotl#1031

Open

8 tasks

qgallouedec mentioned this issue May 17, 2024

Visual DPO huggingface/trl#1647

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i got a Trainer error: Attempting to unscale FP16 gradients #23165

i got a Trainer error: Attempting to unscale FP16 gradients #23165

han508 commented May 5, 2023

sgugger commented May 5, 2023

younesbelkada commented May 5, 2023

younesbelkada commented May 5, 2023

han508 commented May 6, 2023

wang-yiwei commented May 8, 2023

wang-yiwei commented May 8, 2023

younesbelkada commented May 8, 2023 •

edited

Loading

younesbelkada commented May 8, 2023

wang-yiwei commented May 8, 2023 •

edited

Loading

lucasjinreal commented May 24, 2023

lucasjinreal commented May 24, 2023

younesbelkada commented May 24, 2023 •

edited

Loading

lucasjinreal commented May 24, 2023

younesbelkada commented May 24, 2023

lucasjinreal commented May 24, 2023

don-tpanic commented Oct 10, 2023

DnkNju commented Oct 18, 2024

i got a Trainer error: Attempting to unscale FP16 gradients #23165

i got a Trainer error: Attempting to unscale FP16 gradients #23165

Comments

han508 commented May 5, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented May 5, 2023

younesbelkada commented May 5, 2023

younesbelkada commented May 5, 2023

han508 commented May 6, 2023

wang-yiwei commented May 8, 2023

wang-yiwei commented May 8, 2023

younesbelkada commented May 8, 2023 • edited Loading

younesbelkada commented May 8, 2023

wang-yiwei commented May 8, 2023 • edited Loading

lucasjinreal commented May 24, 2023

lucasjinreal commented May 24, 2023

younesbelkada commented May 24, 2023 • edited Loading

lucasjinreal commented May 24, 2023

younesbelkada commented May 24, 2023

lucasjinreal commented May 24, 2023

don-tpanic commented Oct 10, 2023

DnkNju commented Oct 18, 2024

younesbelkada commented May 8, 2023 •

edited

Loading

wang-yiwei commented May 8, 2023 •

edited

Loading

younesbelkada commented May 24, 2023 •

edited

Loading