分布式模型在训练过程中使用SyncBatchNorm报错 #66162

Linda-Deng · 2024-07-18T02:12:29Z

bug描述 Describe the Bug

【问题简述】

分布式训练使用SyncBatchNorm的时候报错，使用正常的batchnorm是没有问题的

【代码】

模型启动脚本：python -m paddle.distributed.launch --selected_gpus=4,5,6 --log_dir=$log_dir train_large.py

train_large.py中模型定义：

model = TimeSeriesTransformer()
model = paddle.nn.SyncBatchNorm.convert_sync_batchnorm(model)
optim = paddle.optimizer.Adam(parameters=model.parameters())
model = fleet.distributed_model(model)
optim = fleet.distributed_optimizer(optim)
【报错信息】

C++ Traceback (most recent call last):

0 paddle::imperative::BasicEngine::Execute()
1 paddle::imperative::PreparedOp::Run(paddle::imperative::NameVariableWrapperMap const&, paddle::imperative::NameVariableWrapperMap const&, paddle::framework::AttributeMap const&, paddle::framework::AttributeMap const&)
2 std::_Function_handler, paddle::operators::SyncBatchNormGradKernel, paddle::operators::SyncBatchNormGradKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
3 paddle::operators::SyncBatchNormGradKernel::Compute(paddle::framework::ExecutionContext const&) const
4 void paddle::operators::SyncBatchNormGradFunctor(paddle::framework::ExecutionContext const&, paddle::experimental::DataLayout, phi::DenseTensor const*, phi::DenseTensor const*, phi::DenseTensor*, phi::DenseTensor const*, phi::DenseTensor*, phi::DenseTensor*, phi::DenseTensor const*, phi::DenseTensor const*, double)
5 phi::DenseTensor::mutable_data(phi::Place const&, paddle::experimental::DataType, unsigned long)
6 phi::DenseTensor::set_type(paddle::experimental::DataType)

【做过的一些debug】

输入的数据都是float32
model = paddle.nn.SyncBatchNorm.convert_sync_batchnorm(model) 这一行注释掉的话，可以正常运行
报错的位置：前向计算完成，在loss.backward()的地方报错，可以稳定复现

其他补充信息 Additional Supplementary Information

No response

Linda-Deng · 2024-08-06T02:43:05Z

之前使用的paddle 版本是2.3.2，将其升级到2.6.1的时候报的错误有变化：
具体报错为：
File "/home/aiusers/dxl/test/depth_based_model_8.2.11.ts1/model_ts5_fix5.py", line 876, in trainTestProcess
loss.backward()
File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/paddle/base/wrapped_decorator.py", line 26, in impl
return wrapped_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/paddle/base/framework.py", line 593, in impl
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/paddle/base/dygraph/tensor_patch_methods.py", line 342, in backward
core.eager.run_backward([self], grad_tensor, retain_graph)
ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr.
[Hint: tensor should not be null.] (at ../paddle/phi/core/device_context.cc:142)

Linda-Deng added status/new-issue 新建 type/bug-report 报bug labels Jul 18, 2024

paddle-bot bot assigned BiynXu Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

分布式模型在训练过程中使用SyncBatchNorm报错 #66162

分布式模型在训练过程中使用SyncBatchNorm报错 #66162

Linda-Deng commented Jul 18, 2024

Linda-Deng commented Aug 6, 2024

分布式模型在训练过程中使用SyncBatchNorm报错 #66162

分布式模型在训练过程中使用SyncBatchNorm报错 #66162

Comments

Linda-Deng commented Jul 18, 2024

bug描述 Describe the Bug

C++ Traceback (most recent call last):

其他补充信息 Additional Supplementary Information

Linda-Deng commented Aug 6, 2024