-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
出现如下warning: tried to get lr value before scheduler/optimizer started stepping, returning lr=0 #134
Comments
如果把logging_steps改为10以上呢? |
10以上是肯定会有的,但是问题是bloom config里设置了"gradient_accumulation_steps": 32,意味着每一步的logging都是经历了32个batch,如果这样的话前几个steps没有学习率的话,多少有点不对劲呢 |
有在transformers的issue里面看过类似的,貌似说法是deepspeed config里设置lr、optimizer的问题导致,还有说法是模型之前是bf16,但是现在设置的fp16? issue如下: |
请问,如果按照官方的bloom config和deepspeed config运行的话,会出现lr = 0的问题嘛? |
您实验的机器是A100嘛,我们实验时并没有遇到lr=0的问题 |
是80G的A100,如果设置logging step 1的话,会出现这种情况嘛? |
我们会找时间尝试一下,看看能不能复现这个问题。
…------------------ 原始邮件 ------------------
发件人: "LianjiaTech/BELLE" ***@***.***>;
发送时间: 2023年4月9日(星期天) 晚上7:08
***@***.***>;
***@***.******@***.***>;
主题: Re: [LianjiaTech/BELLE] 出现如下warning: tried to get lr value before scheduler/optimizer started stepping, returning lr=0 (Issue #134)
igscience/bloomz-7b1-mt", "data_path": "data/res/merge_data.json", "output_dir": "trained_models/bloom", "per_device_train_batch_size": 1, "num_epochs": 2, "learning_rate": 1e-5, "cutoff_len": 1024, "val_set_size": 1000, "val_set_rate": 0.1, "save_steps": 1000, "eval_steps": 1000, "logging_steps": 1, "gradient_accumulation_steps": 32 }
deepspeed config为: { "train_batch_size": "auto
您实验的机器是A100嘛,我们实验时并没有遇到lr=0的问题
是80G的A100,如果设置logging step 1的话,会出现这种情况嘛?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
好嘞,期待反馈 |
您好,做了如下的实现,其中bloom config为: deepspeed config #1 的配置为: deepspeed config #2 的配置为(接近官方提供的配置):
"overwrite":true, "scheduler": { |
在单纯使用1b1模型,不使用deepspeed进行微调时,学习率变化如下: |
在使用deepspeed config 1 的配置时,学习率变化如下: |
在使用deepspeed config 2 的配置时,学习率变化如下: |
看了一下trainer的default优化器貌似是adamw,但是官方提供的deepspeed配置文件里的优化器type为adam。其次,deepspeed配置文件里如果加入fp16和lr scheduler的话,就会存在前几个step学习率为0的情况。 @xianghuisun |
还请问您们一下,如果使用bloom做指令微调的话,是需要对bloom-7b1的模型的词表进行扩充嘛? @xianghuisun |
这学习率 越学越大? |
warmup_lr啊 |
大佬,我是一直lr是0,这个会是什么原因导致的?一直有这个warning |
把deepspeed的config里面fp16和lr scheduler配置去掉,optimizer改adamw试试,按照我的配置试试 |
这些配置试过了,会有同样的问题,我甚至没有开warmup, 用的bf16,多机多卡,目前的问题是,不确定到底多少个steps lr能够跳出0,有的时候很快就跳出0了,有的时候要几百个steps,有的时候就一直不跳出0。 而且之前finetune其他模型没有遇到过这个问题.... 会不会是硬件或者环境有问题 |
感觉不是硬件或者环境问题,我这个issue里面贴了一个transformers的issues。出现这种问题有可能是bloom这个模型在预训练的时候用的参数导致。可能是这种情况,我也不是很确定,希望官方有空能验证一下,找出问题 |
我这边用的llama,也是这个问题。 |
有解决方案嘛?兄弟 @HalcyonLiang |
我没探究根本原因,只是对比了下不同的配置,用其他配置代替了避免了这个问题 |
您显卡是真的多,牛逼 |
我最近在用peft lora微调llama-7b-hf的时候也遇到了这个问题,最后发现是库版本的问题,把transformers降级到4.28.0,deepspeed降级到0.8.3就解决了。 |
谢谢了!我用你的方法成功了! |
Perhaps the batch size is set so large that it lead to “CUDA out of memory”, but the program does not report an error. Try to make the ”train_micro_batch_size_per_gpu“ parameter smaller, Here's what I tried: train_micro_batch_size_per_gpu = 4 train_micro_batch_size_per_gpu = 1 |
感谢,我把transformers降级到4.28.0,deepspeed保持在0.12.6,也解决了这个问题 |
我是在使用deepspeed微调flant5系列模型时遇到的该问题,lr一直为0,上述方法只有对Transformers版本降级有效,且deepspeed不需要降级;transformers==4.40 --> 4.28.1, deepspeed=0.9.3 |
不知是 feature 还是 bug [/doge] https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_pt_utils.py#+L912 def _get_learning_rate(self):
if self.is_deepspeed_enabled:
# with deepspeed's fp16 and dynamic loss scale enabled the optimizer/scheduler steps may
# not run for the first few dozen steps while loss scale is too large, and thus during
# that time `get_last_lr` will fail if called during that warm up stage, so work around it:
try:
last_lr = self.lr_scheduler.get_last_lr()[0]
except AssertionError as e:
if "need to call step" in str(e):
logger.warning("tried to get lr value before scheduler/optimizer started stepping, returning lr=0")
last_lr = 0
else:
raise
else:
if isinstance(self.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
last_lr = self.optimizer.param_groups[0]["lr"]
else:
last_lr = self.lr_scheduler.get_last_lr()[0]
if torch.is_tensor(last_lr):
last_lr = last_lr.item()
return last_lr |
您好,在使用finetune脚本使用指令微调数据集微调bloom-7b模型时前几个step出现:
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
这个warning是什么原因呢?
bloom config为:
{
"model_type": "bloom",
"model_name_or_path": "bigscience/bloomz-7b1-mt",
"data_path": "data/res/merge_data.json",
"output_dir": "trained_models/bloom",
"per_device_train_batch_size": 1,
"num_epochs": 2,
"learning_rate": 1e-5,
"cutoff_len": 1024,
"val_set_size": 1000,
"val_set_rate": 0.1,
"save_steps": 1000,
"eval_steps": 1000,
"logging_steps": 1,
"gradient_accumulation_steps": 32
}
deepspeed config为:
{
"train_batch_size": "auto",
"overwrite":true,
"gradient_accumulation_steps": "auto",
"fp16": {
"enabled": true,
"min_loss_scale": 1,
"opt_level": "O2"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
}
}
The text was updated successfully, but these errors were encountered: