-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
i got a Trainer error: Attempting to unscale FP16 gradients #23165
Comments
You can't train a model loaded in FP16:
is the culprit here. I don't know how PEFT initializes the layer to train afterwards, but some of them must be in the same dtype cc @younesbelkada |
I second what @sgugger said, pip install --upgrade peft In my opinion, to use PEFT at its best, you should load your model in 8bit as follows: from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training
from transformers import LlamaTokenizer, LlamaForCausalLM
path_to_llama = xxx
model = LlamaForCausalLM.from_pretrained(
path_to_llama,
device_map="auto",
load_in_8bit=True
)
tokenizer = LlamaTokenizer.from_pretrained(path_to_llama)
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = prepare_model_for_int8_training(model)
model = get_peft_model(model, config)
... # get your dataset etc here
trainer = Trainer(
model=model,
...
) Also make sure to use pip install --upgrade transformers |
For reference, I would have a look at how the PEFT slow tests are designed, check here: https://github.com/huggingface/peft/blob/b1059b73aab9043b118ff19b0cf96263ea86248a/tests/test_gpu_examples.py#L114 |
Thank you for your reply, when I update the latest PEFT and transformers, All problems are resolved. |
Thanks for the answer, it saved me some time to test if it is possible to fine tune a model loaded in FP16. |
Hi Younes, thank you for your work on PEFT. https://arxiv.org/abs/2304.14454 https://arxiv.org/abs/2304.08109 I am very grateful that we have these open-source fine-tuning techniques. But I am curious about your opinions on the performance trade-off between lora and full-tuing? Thanks for your concerns. |
Hi @IwanVan 1- Sadly it is not possible to do pure int8 training, (i.e. pass the full 8bit model to the optimizer state) as I believe this will result in a very unstable training as your weight matrix can be only represented in 8bit precision (256 possible values), so the model won't probably learn anything. Although it's not possible to train in pure fp16 (from my understanding), you can train your model in a precision called 2- This seems to be a new paper so it's the first time I go through, and from my understanding it tries to fine-tune Llama to the medical paper domain. I agree the differences here sound quite large. Thinking about it loud, maybe the domain gap was too high for that model but I am not sure. Empirically it has been showed (from the original paper and from what I have seen so far) that you can get very comparable results (sometimes better results) than full finetuning when going for PEFT methods (and on all modalities, vision, text, RLHF, etc.), so I would say it would really depend on your usecase, dataset, etc.. Thanks! |
Because in the case of tuning the LoRA layers, the base model will stay untouched, in 8bit, but the LoRA layers that we're going to train will be kept in full precision (float32) |
Hi @younesbelkada , thanks again for your quick response.
But I wonder if I could open a new issue in the peft repository to follow up on the current new research on peft/lora and see if I could find a reasonable explanation for the difference in performance across different fine-tuning techniques by documenting and analysing similar papers over time and get more developers involved in the discussion? Regards, |
@younesbelkada Hello, I load 7B llama for peft Lora finetune on a single v100 but got OOM, is that normal? am using default float(32). Does it have to be load in in8 for lora finetuning? |
@younesbelkada after load in in8, I got error like this:
I swear i have no where set float16 in my code..... |
hi @lucasjinreal |
@younesbelkada I noticed that LION merge into master, when will it update to pip btw?
Yes, I have used. after I set But, do u know why 32GB unable to train with float32? Am have to using deepspeed to offload now, and int8 training seems slower than offload |
hi @lucasjinreal
Awesome!
Yes int8 can be slower in some cases, you might be interested in using FP4 quantization that should be much faster, it will be part of the announcement today as well. I will keep you posted Relevant links: https://github.com/artidoro/qlora & #23479 |
@younesbelkada Looking forward to it, do u mean fp4 training? Looks like only decent GPU like H100 support it. Will transformers new release incuding this as well? |
Could you explain what you mean by cannot train a fp16 model? Is it because you would need a fp32 copy of weights for fp16 mixed precision training? |
However, my code uses flashattention, flashattention does not support float32, only bf16 and fp16, what should I do? This seems to create a contradiction |
System Info
transformers
version: 4.28.1Who can help?
@sgugger Now, when I add fp16=True, i get the error:
ValueError: Attempting to unscale FP16 gradients. when running trainer.train()
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
from transformers import LlamaTokenizer, LlamaForCausalLM,AutoTokenizer,AutoModelForSeq2SeqLM, LlamaConfig
from peft import prepare_model_for_int8_training, LoraConfig, get_peft_model, get_peft_model_state_dict
merge_tokenizer = LlamaTokenizer.from_pretrained('/home/han/new_store/Llama/merged_tokenizer_hf',padding=True, truncation=True)
print(len(merge_tokenizer))
n = merge_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
len(merge_tokenizer)
from datasets import load_dataset
dataset = load_dataset("json", data_files="./data/alpaca_data_zh_51k.json")
dataset = dataset.filter(lambda x: x["output"]!=None)
dataset = dataset.filter(lambda x: x["input"] !=None)
def preprocess_function(sample):
input_data = dataset['train'].map(preprocess_function,batched=True,remove_columns=['instruction','input','output'])
import torch
model = LlamaForCausalLM.from_pretrained('decapoda-research/llama-7b-hf',device_map='auto',cache_dir='./cache/',torch_dtype=torch.float16)
model.resize_token_embeddings(len(merge_tokenizer))
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
trainArgs = TrainingArguments(
output_dir= '../ckps_emb',
do_train=True,
# per_device_train_batch_size=4,
auto_find_batch_size=True,
fp16=True,
gradient_accumulation_steps=4,
evaluation_strategy="steps",
save_strategy="steps",
save_steps=1000,
eval_steps=1000,
logging_steps=20,
warmup_steps=100,
num_train_epochs=2,
learning_rate=5e-4,
load_best_model_at_end=True,
)
for name, param in model.named_parameters():
param.requires_grad_(False)
if name =='model.embed_tokens.weight':
param.requires_grad_(True)
print(name, "requires_grad:", param.requires_grad)
trainer = Trainer(
model=model,
args=trainArgs,
train_dataset=input_data,
eval_dataset=input_data,
data_collator=DataCollatorForLanguageModeling(merge_tokenizer, mlm=False),
)
model.config.use_cache = True
trainer.train()
model.save_pretrained('../ckps/demo_llama71_full')
Expected behavior
i except it does not give a error ,ValueError:Attempting to unscale FP16 gradients.
The text was updated successfully, but these errors were encountered: