Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

出现如下warning: tried to get lr value before scheduler/optimizer started stepping, returning lr=0 #134

Open
ZeyuTeng96 opened this issue Apr 9, 2023 · 30 comments

Comments

@ZeyuTeng96
Copy link

您好,在使用finetune脚本使用指令微调数据集微调bloom-7b模型时前几个step出现:

tried to get lr value before scheduler/optimizer started stepping, returning lr=0

这个warning是什么原因呢?

bloom config为:
{
"model_type": "bloom",
"model_name_or_path": "bigscience/bloomz-7b1-mt",
"data_path": "data/res/merge_data.json",
"output_dir": "trained_models/bloom",
"per_device_train_batch_size": 1,
"num_epochs": 2,
"learning_rate": 1e-5,
"cutoff_len": 1024,
"val_set_size": 1000,
"val_set_rate": 0.1,
"save_steps": 1000,
"eval_steps": 1000,
"logging_steps": 1,
"gradient_accumulation_steps": 32
}

deepspeed config为:
{
"train_batch_size": "auto",

"optimizer": {
  "type": "Adam",
  "params": {
    "lr": "auto",
    "betas": [
      0.9,
      0.999
    ],
    "eps": "auto",
    "weight_decay": "auto"
  }
},

"overwrite":true,
"gradient_accumulation_steps": "auto",
"fp16": {
"enabled": true,
"min_loss_scale": 1,
"opt_level": "O2"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},

"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
}
}

@xianghuisun
Copy link
Collaborator

xianghuisun commented Apr 9, 2023

如果把logging_steps改为10以上呢?

@ZeyuTeng96
Copy link
Author

如果把logging_steps改为10以上呢?

10以上是肯定会有的,但是问题是bloom config里设置了"gradient_accumulation_steps": 32,意味着每一步的logging都是经历了32个batch,如果这样的话前几个steps没有学习率的话,多少有点不对劲呢

@ZeyuTeng96
Copy link
Author

如果把logging_steps改为10以上呢?

有在transformers的issue里面看过类似的,貌似说法是deepspeed config里设置lr、optimizer的问题导致,还有说法是模型之前是bf16,但是现在设置的fp16?

issue如下:
huggingface/transformers#14531

@ZeyuTeng96
Copy link
Author

如果把logging_steps改为10以上呢?

请问,如果按照官方的bloom config和deepspeed config运行的话,会出现lr = 0的问题嘛?

@xianghuisun
Copy link
Collaborator

igscience/bloomz-7b1-mt", "data_path": "data/res/merge_data.json", "output_dir": "trained_models/bloom", "per_device_train_batch_size": 1, "num_epochs": 2, "learning_rate": 1e-5, "cutoff_len": 1024, "val_set_size": 1000, "val_set_rate": 0.1, "save_steps": 1000, "eval_steps": 1000, "logging_steps": 1, "gradient_accumulation_steps": 32 }

deepspeed config为: { "train_batch_size": "auto

您实验的机器是A100嘛,我们实验时并没有遇到lr=0的问题

@ZeyuTeng96
Copy link
Author

igscience/bloomz-7b1-mt", "data_path": "data/res/merge_data.json", "output_dir": "trained_models/bloom", "per_device_train_batch_size": 1, "num_epochs": 2, "learning_rate": 1e-5, "cutoff_len": 1024, "val_set_size": 1000, "val_set_rate": 0.1, "save_steps": 1000, "eval_steps": 1000, "logging_steps": 1, "gradient_accumulation_steps": 32 }
deepspeed config为: { "train_batch_size": "auto

您实验的机器是A100嘛,我们实验时并没有遇到lr=0的问题

是80G的A100,如果设置logging step 1的话,会出现这种情况嘛?

@xianghuisun
Copy link
Collaborator

xianghuisun commented Apr 9, 2023 via email

@ZeyuTeng96
Copy link
Author

我们会找时间尝试一下,看看能不能复现这个问题。

------------------ 原始邮件 ------------------ 发件人: "LianjiaTech/BELLE" @.>; 发送时间: 2023年4月9日(星期天) 晚上7:08 @.>; @.@.>; 主题: Re: [LianjiaTech/BELLE] 出现如下warning: tried to get lr value before scheduler/optimizer started stepping, returning lr=0 (Issue #134) igscience/bloomz-7b1-mt", "data_path": "data/res/merge_data.json", "output_dir": "trained_models/bloom", "per_device_train_batch_size": 1, "num_epochs": 2, "learning_rate": 1e-5, "cutoff_len": 1024, "val_set_size": 1000, "val_set_rate": 0.1, "save_steps": 1000, "eval_steps": 1000, "logging_steps": 1, "gradient_accumulation_steps": 32 } deepspeed config为: { "train_batch_size": "auto 您实验的机器是A100嘛,我们实验时并没有遇到lr=0的问题 是80G的A100,如果设置logging step 1的话,会出现这种情况嘛? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

好嘞,期待反馈

@ZeyuTeng96
Copy link
Author

您好,做了如下的实现,其中bloom config为:
{
"model_type": "bloom",
"model_name_or_path": "bigscience/bloom-1b1",
"data_path": "data/trans_1.json",
"output_dir": "trained_models/bloom",
"per_device_train_batch_size": 1,
"num_epochs": 2,
"learning_rate": 1e-5,
"cutoff_len": 1024,
"val_set_size": 1000,
"val_set_rate": 0.1,
"save_steps": 1000,
"eval_steps": 1000,
"logging_steps": 1,
"gradient_accumulation_steps": 32
}

deepspeed config #1 的配置为:
{
"train_batch_size": "auto",
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"overwrite":true,
"gradient_accumulation_steps": "auto",
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

deepspeed config #2 的配置为(接近官方提供的配置):
{
"train_batch_size": "auto",

"optimizer": {
  "type": "Adam",
  "params": {
    "lr": "auto",
    "betas": [
      0.9,
      0.999
    ],
    "eps": "auto",
    "weight_decay": "auto"
  }
},

"overwrite":true,
"gradient_accumulation_steps": "auto",
"fp16": {
"enabled": true,
"min_loss_scale": 1,
"opt_level": "O2"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},

"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
}
}

@ZeyuTeng96
Copy link
Author

在单纯使用1b1模型,不使用deepspeed进行微调时,学习率变化如下:
{'loss': 2.6999, 'learning_rate': 5.263157894736843e-07, 'epoch': 0.01}
{'loss': 2.7946, 'learning_rate': 1.0526315789473685e-06, 'epoch': 0.02}
{'loss': 3.1472, 'learning_rate': 1.5789473684210526e-06, 'epoch': 0.03}
{'loss': 2.7722, 'learning_rate': 2.105263157894737e-06, 'epoch': 0.04}
{'loss': 2.9574, 'learning_rate': 2.631578947368421e-06, 'epoch': 0.05}
{'loss': 2.7037, 'learning_rate': 2.631578947368421e-06, 'epoch': 0.07}
{'loss': 2.9451, 'learning_rate': 2.631578947368421e-06, 'epoch': 0.08}
{'loss': 2.8337, 'learning_rate': 3.157894736842105e-06, 'epoch': 0.09}
{'loss': 2.9723, 'learning_rate': 3.6842105263157896e-06, 'epoch': 0.1}
{'loss': 3.008, 'learning_rate': 4.210526315789474e-06, 'epoch': 0.11}
{'loss': 3.0198, 'learning_rate': 4.736842105263158e-06, 'epoch': 0.12}
{'loss': 2.9892, 'learning_rate': 5.263157894736842e-06, 'epoch': 0.13}
{'loss': 2.4021, 'learning_rate': 5.789473684210527e-06, 'epoch': 0.14}
{'loss': 2.344, 'learning_rate': 5.789473684210527e-06, 'epoch': 0.15}
{'loss': 2.4769, 'learning_rate': 6.31578947368421e-06, 'epoch': 0.16}
{'loss': 2.2217, 'learning_rate': 6.842105263157896e-06, 'epoch': 0.18}
{'loss': 2.4098, 'learning_rate': 6.842105263157896e-06, 'epoch': 0.19}
{'loss': 1.9803, 'learning_rate': 7.368421052631579e-06, 'epoch': 0.2}
{'loss': 2.1771, 'learning_rate': 7.894736842105265e-06, 'epoch': 0.21}
{'loss': 2.4345, 'learning_rate': 8.421052631578948e-06, 'epoch': 0.22}
{'loss': 2.4525, 'learning_rate': 8.947368421052632e-06, 'epoch': 0.23}
{'loss': 2.585, 'learning_rate': 9.473684210526315e-06, 'epoch': 0.24}
{'loss': 2.7307, 'learning_rate': 1e-05, 'epoch': 0.25}

@ZeyuTeng96
Copy link
Author

在使用deepspeed config 1 的配置时,学习率变化如下:
{'loss': 2.8091, 'learning_rate': 5.263157894736843e-07, 'epoch': 0.01}
{'loss': 2.8488, 'learning_rate': 1.0526315789473685e-06, 'epoch': 0.02}
{'loss': 2.9292, 'learning_rate': 1.5789473684210526e-06, 'epoch': 0.03}
{'loss': 2.8395, 'learning_rate': 2.105263157894737e-06, 'epoch': 0.04}
{'loss': 3.1188, 'learning_rate': 2.631578947368421e-06, 'epoch': 0.05}
{'loss': 2.9179, 'learning_rate': 3.157894736842105e-06, 'epoch': 0.07}
{'loss': 2.8102, 'learning_rate': 3.6842105263157896e-06, 'epoch': 0.08}
{'loss': 2.8484, 'learning_rate': 4.210526315789474e-06, 'epoch': 0.09}
{'loss': 2.9805, 'learning_rate': 4.736842105263158e-06, 'epoch': 0.1}
{'loss': 2.7548, 'learning_rate': 5.263157894736842e-06, 'epoch': 0.11}
{'loss': 2.6809, 'learning_rate': 5.789473684210527e-06, 'epoch': 0.12}
{'loss': 2.5852, 'learning_rate': 6.31578947368421e-06, 'epoch': 0.13}
{'loss': 2.6456, 'learning_rate': 6.842105263157896e-06, 'epoch': 0.14}
{'loss': 2.6222, 'learning_rate': 7.368421052631579e-06, 'epoch': 0.15}
{'loss': 2.2331, 'learning_rate': 7.894736842105265e-06, 'epoch': 0.16}
{'loss': 2.2346, 'learning_rate': 8.421052631578948e-06, 'epoch': 0.18}
{'loss': 1.9481, 'learning_rate': 8.947368421052632e-06, 'epoch': 0.19}
{'loss': 1.98, 'learning_rate': 9.473684210526315e-06, 'epoch': 0.2}
{'loss': 2.2987, 'learning_rate': 1e-05, 'epoch': 0.21}

@ZeyuTeng96
Copy link
Author

在使用deepspeed config 2 的配置时,学习率变化如下:
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 2.7112, 'learning_rate': 0, 'epoch': 0.01}
1%|▉ | 2/182 [01:05<1:37:41, 32.56s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 2.9341, 'learning_rate': 0, 'epoch': 0.02}
2%|█▎ | 3/182 [01:37<1:36:59, 32.51s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 3.093, 'learning_rate': 0, 'epoch': 0.03}
{'loss': 2.9688, 'learning_rate': 0.0, 'epoch': 0.04}
{'loss': 2.9455, 'learning_rate': 2.3540891336663827e-06, 'epoch': 0.05}
{'loss': 3.0102, 'learning_rate': 2.3540891336663827e-06, 'epoch': 0.07}
{'loss': 3.1245, 'learning_rate': 3.73114300021637e-06, 'epoch': 0.08}
{'loss': 2.8258, 'learning_rate': 4.7081782673327655e-06, 'epoch': 0.09}
{'loss': 2.9814, 'learning_rate': 5.466025697329025e-06, 'epoch': 0.1}
{'loss': 2.5915, 'learning_rate': 5.466025697329025e-06, 'epoch': 0.11}
{'loss': 2.8165, 'learning_rate': 6.0852321338827525e-06, 'epoch': 0.12}
{'loss': 2.6727, 'learning_rate': 6.60876371636064e-06, 'epoch': 0.13}
{'loss': 2.7603, 'learning_rate': 7.062267400999148e-06, 'epoch': 0.14}
{'loss': 2.0928, 'learning_rate': 7.46228600043274e-06, 'epoch': 0.15}
{'loss': 2.4763, 'learning_rate': 7.820114830995408e-06, 'epoch': 0.16}
{'loss': 2.2755, 'learning_rate': 8.143810382095967e-06, 'epoch': 0.18}
{'loss': 2.07, 'learning_rate': 8.439321267549136e-06, 'epoch': 0.19}
{'loss': 1.9242, 'learning_rate': 8.711164930263437e-06, 'epoch': 0.2}
{'loss': 2.0989, 'learning_rate': 8.962852850027021e-06, 'epoch': 0.21}
{'loss': 1.9225, 'learning_rate': 9.197168697545394e-06, 'epoch': 0.22}
{'loss': 1.766, 'learning_rate': 9.416356534665531e-06, 'epoch': 0.23}
{'loss': 2.6338, 'learning_rate': 9.416356534665531e-06, 'epoch': 0.24}
{'loss': 2.7871, 'learning_rate': 9.622251858852542e-06, 'epoch': 0.25}
{'loss': 3.1649, 'learning_rate': 9.816375134099122e-06, 'epoch': 0.26}
{'loss': 2.9512, 'learning_rate': 1e-05, 'epoch': 0.27}

@ZeyuTeng96
Copy link
Author

看了一下trainer的default优化器貌似是adamw,但是官方提供的deepspeed配置文件里的优化器type为adam。其次,deepspeed配置文件里如果加入fp16和lr scheduler的话,就会存在前几个step学习率为0的情况。 @xianghuisun

@ZeyuTeng96
Copy link
Author

还请问您们一下,如果使用bloom做指令微调的话,是需要对bloom-7b1的模型的词表进行扩充嘛? @xianghuisun

@wind91725
Copy link

这学习率 越学越大?

@ZeyuTeng96
Copy link
Author

这学习率 越学越大?

warmup_lr啊

@hao-xyz
Copy link

hao-xyz commented Apr 12, 2023

这学习率 越学越大?

warmup_lr啊

大佬,我是一直lr是0,这个会是什么原因导致的?一直有这个warning

@ZeyuTeng96
Copy link
Author

这学习率 越学越大?

warmup_lr啊

大佬,我是一直lr是0,这个会是什么原因导致的?一直有这个warning

把deepspeed的config里面fp16和lr scheduler配置去掉,optimizer改adamw试试,按照我的配置试试

@hao-xyz
Copy link

hao-xyz commented Apr 12, 2023

这学习率 越学越大?

warmup_lr啊

大佬,我是一直lr是0,这个会是什么原因导致的?一直有这个warning

把deepspeed的config里面fp16和lr scheduler配置去掉,optimizer改adamw试试,按照我的配置试试

这些配置试过了,会有同样的问题,我甚至没有开warmup, 用的bf16,多机多卡,目前的问题是,不确定到底多少个steps lr能够跳出0,有的时候很快就跳出0了,有的时候要几百个steps,有的时候就一直不跳出0。 而且之前finetune其他模型没有遇到过这个问题.... 会不会是硬件或者环境有问题

@ZeyuTeng96
Copy link
Author

这学习率 越学越大?

warmup_lr啊

大佬,我是一直lr是0,这个会是什么原因导致的?一直有这个warning

把deepspeed的config里面fp16和lr scheduler配置去掉,optimizer改adamw试试,按照我的配置试试

这些配置试过了,会有同样的问题,我甚至没有开warmup, 用的bf16,多机多卡,目前的问题是,不确定到底多少个steps lr能够跳出0,有的时候很快就跳出0了,有的时候要几百个steps,有的时候就一直不跳出0。 而且之前finetune其他模型没有遇到过这个问题.... 会不会是硬件或者环境有问题

感觉不是硬件或者环境问题,我这个issue里面贴了一个transformers的issues。出现这种问题有可能是bloom这个模型在预训练的时候用的参数导致。可能是这种情况,我也不是很确定,希望官方有空能验证一下,找出问题

@hao-xyz
Copy link

hao-xyz commented Apr 13, 2023

这学习率 越学越大?

warmup_lr啊

大佬,我是一直lr是0,这个会是什么原因导致的?一直有这个warning

把deepspeed的config里面fp16和lr scheduler配置去掉,optimizer改adamw试试,按照我的配置试试

这些配置试过了,会有同样的问题,我甚至没有开warmup, 用的bf16,多机多卡,目前的问题是,不确定到底多少个steps lr能够跳出0,有的时候很快就跳出0了,有的时候要几百个steps,有的时候就一直不跳出0。 而且之前finetune其他模型没有遇到过这个问题.... 会不会是硬件或者环境有问题

感觉不是硬件或者环境问题,我这个issue里面贴了一个transformers的issues。出现这种问题有可能是bloom这个模型在预训练的时候用的参数导致。可能是这种情况,我也不是很确定,希望官方有空能验证一下,找出问题

我这边用的llama,也是这个问题。
huggingface 报这个warning的地方的说明,但是我用的bf16,zero2也是报这个warning
# with deepspeed's fp16 and dynamic loss scale enabled the optimizer/scheduler steps may
# not run for the first few dozen steps while loss scale is too large, and thus during
# that time get_last_lr will fail if called during that warm up stage, so work around it:

@ZeyuTeng96
Copy link
Author

有解决方案嘛?兄弟 @HalcyonLiang

@hao-xyz
Copy link

hao-xyz commented Apr 17, 2023

有解决方案嘛?兄弟 @HalcyonLiang

我没探究根本原因,只是对比了下不同的配置,用其他配置代替了避免了这个问题
7B 8张A100不用开zero就能训练,没有这个问题,
7B 16张A100 zero2 不开optimizor offload 没有这个问题
13B 16张A100 zero3 不开optimizor和params的offload 没有这个问题
13B 24张A100 zero2 不开optimizor offload 存在有这个问题 (显像看是多卡分割gradient的时候,显存占用差的有些多,要等分配差不多均匀后,LR才会开始逐渐开始warmup的过程)
有时间的话,可以再多测试下,供参考

@ZeyuTeng96
Copy link
Author

您显卡是真的多,牛逼

@WalterSumbon
Copy link

我最近在用peft lora微调llama-7b-hf的时候也遇到了这个问题,最后发现是库版本的问题,把transformers降级到4.28.0,deepspeed降级到0.8.3就解决了。

@Neo-Zhangjiajie
Copy link

我最近在用peft lora微调llama-7b-hf的时候也遇到了这个问题,最后发现是库版本的问题,把transformers降级到4.28.0,deepspeed降级到0.8.3就解决了。

谢谢了!我用你的方法成功了!

@chenhuixi-1995
Copy link

Perhaps the batch size is set so large that it lead to “CUDA out of memory”, but the program does not report an error. Try to make the ”train_micro_batch_size_per_gpu“ parameter smaller, Here's what I tried:

train_micro_batch_size_per_gpu = 4
gradient_accumulation_steps = 1
it failed,returning lr=0

train_micro_batch_size_per_gpu = 1
gradient_accumulation_steps = 4
it worked

@Lui-16
Copy link

Lui-16 commented Feb 25, 2024

我最近在用peft lora微调llama-7b-hf的时候也遇到了这个问题,最后发现是库版本的问题,把transformers降级到4.28.0,deepspeed降级到0.8.3就解决了。

感谢,我把transformers降级到4.28.0,deepspeed保持在0.12.6,也解决了这个问题

@ZyangLee
Copy link

我是在使用deepspeed微调flant5系列模型时遇到的该问题,lr一直为0,上述方法只有对Transformers版本降级有效,且deepspeed不需要降级;transformers==4.40 --> 4.28.1, deepspeed=0.9.3

@shihanmax
Copy link

不知是 feature 还是 bug [/doge]

https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_pt_utils.py#+L912

def _get_learning_rate(self):
    if self.is_deepspeed_enabled:
        # with deepspeed's fp16 and dynamic loss scale enabled the optimizer/scheduler steps may
        # not run for the first few dozen steps while loss scale is too large, and thus during
        # that time `get_last_lr` will fail if called during that warm up stage, so work around it:
        try:
            last_lr = self.lr_scheduler.get_last_lr()[0]
        except AssertionError as e:
            if "need to call step" in str(e):
                logger.warning("tried to get lr value before scheduler/optimizer started stepping, returning lr=0")
                last_lr = 0
            else:
                raise
    else:
        if isinstance(self.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
            last_lr = self.optimizer.param_groups[0]["lr"]
        else:
            last_lr = self.lr_scheduler.get_last_lr()[0]
        if torch.is_tensor(last_lr):
            last_lr = last_lr.item()
    return last_lr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants