Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分布式模型在训练过程中使用SyncBatchNorm报错 #66162

Open
Linda-Deng opened this issue Jul 18, 2024 · 1 comment
Open

分布式模型在训练过程中使用SyncBatchNorm报错 #66162

Linda-Deng opened this issue Jul 18, 2024 · 1 comment
Assignees

Comments

@Linda-Deng
Copy link

bug描述 Describe the Bug

【问题简述】

分布式训练使用SyncBatchNorm的时候报错,使用正常的batchnorm是没有问题的

【代码】

模型启动脚本:python -m paddle.distributed.launch --selected_gpus=4,5,6 --log_dir=$log_dir train_large.py

train_large.py中模型定义:

model = TimeSeriesTransformer()
model = paddle.nn.SyncBatchNorm.convert_sync_batchnorm(model)
optim = paddle.optimizer.Adam(parameters=model.parameters())
model = fleet.distributed_model(model)
optim = fleet.distributed_optimizer(optim)
【报错信息】

C++ Traceback (most recent call last):

0 paddle::imperative::BasicEngine::Execute()
1 paddle::imperative::PreparedOp::Run(paddle::imperative::NameVariableWrapperMap const&, paddle::imperative::NameVariableWrapperMap const&, paddle::framework::AttributeMap const&, paddle::framework::AttributeMap const&)
2 std::_Function_handler, paddle::operators::SyncBatchNormGradKernel, paddle::operators::SyncBatchNormGradKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
3 paddle::operators::SyncBatchNormGradKernel::Compute(paddle::framework::ExecutionContext const&) const
4 void paddle::operators::SyncBatchNormGradFunctor(paddle::framework::ExecutionContext const&, paddle::experimental::DataLayout, phi::DenseTensor const*, phi::DenseTensor const*, phi::DenseTensor*, phi::DenseTensor const*, phi::DenseTensor*, phi::DenseTensor*, phi::DenseTensor const*, phi::DenseTensor const*, double)
5 phi::DenseTensor::mutable_data(phi::Place const&, paddle::experimental::DataType, unsigned long)
6 phi::DenseTensor::set_type(paddle::experimental::DataType)

【做过的一些debug】

  1. 输入的数据都是float32
  2. model = paddle.nn.SyncBatchNorm.convert_sync_batchnorm(model) 这一行注释掉的话,可以正常运行
  3. 报错的位置:前向计算完成,在loss.backward()的地方报错,可以稳定复现

其他补充信息 Additional Supplementary Information

No response

@Linda-Deng
Copy link
Author

之前使用的paddle 版本是2.3.2,将其升级到2.6.1的时候报的错误有变化:
具体报错为:
File "/home/aiusers/dxl/test/depth_based_model_8.2.11.ts1/model_ts5_fix5.py", line 876, in trainTestProcess
    loss.backward()
  File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/paddle/base/wrapped_decorator.py", line 26, in impl
    return wrapped_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/paddle/base/framework.py", line 593, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/paddle/base/dygraph/tensor_patch_methods.py", line 342, in backward
    core.eager.run_backward([self], grad_tensor, retain_graph)
ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr.
  [Hint: tensor should not be null.] (at ../paddle/phi/core/device_context.cc:142)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants