-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: llama模型loss=0时出现"Tensor need be reduced must not empty [Hint: Expected x.numel() > 0, but received x.numel():0 <= 0:0.]"错误 #8299
Comments
感谢您的反馈,我查了一下是这个pr引入的问题: 93e78c2#diff-99e104eff4c095428aa1cd5d186107ae22737297e8ec3b5c12cd138e69a79cb5 看看下面的实现能否解决您的问题:
|
@w5688414 好的,看上去这样,如果数据集处理得没问题,应该能保证 |
在使用pipeparallel=2、shardingstage1配置跑llama模型pretrain时,又踩到了这个坑,定位到是现在的loss函数返回了loss=float(0),导致触发了paddle/distributed/fleet/meta_parallel/pipeline_parallel.py中的assert,log如下: 在使用 #8459 中的修复方法之后,绕过了pp中的类型检查,但是程序会在step=81这步卡住,不能再正常向下运行。推测是否是新建tensor导致梯度断掉,而导致pp配置下的某些通讯逻辑不能正常执行 [32m[2024-05-15 16:26:28,733] [ INFO]�[0m - loss: 7.44834805, learning_rate: 2.4e-06, global_step: 79, current_memory_allocated: 42.891517996788025, current_memory_reserved: 0.0, max_memory_allocated: 82.25603437423706, max_memory_reserved: 0.0, interval_runtime: 29.755, interval_samples_per_second: 4.3018, interval_tokens_per_second_per_device: 2202.5182, interval_steps_per_second: 0.0336, progress_or_epoch: 0.0008�[0m |
软件环境
重复问题
错误描述
稳定复现步骤 & 代码
错误来自于这两行
由于
masked_lm_loss.numel() == 0
,对其进行paddle.mean
操作会报如上错误,loss为0的原因应该是softmax操作产生了onehot tensor, 只有target label对应位置的值为1,其它位置为0。当exp的指数较小时(小于-1000),结果会等于0
参考资料:
The text was updated successfully, but these errors were encountered: