You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
之前使用的paddle 版本是2.3.2,将其升级到2.6.1的时候报的错误有变化:
具体报错为:
File "/home/aiusers/dxl/test/depth_based_model_8.2.11.ts1/model_ts5_fix5.py", line 876, in trainTestProcess
loss.backward()
File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/paddle/base/wrapped_decorator.py", line 26, in impl
return wrapped_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/paddle/base/framework.py", line 593, in impl
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/local/anaconda3/envs/py312/lib/python3.12/site-packages/paddle/base/dygraph/tensor_patch_methods.py", line 342, in backward
core.eager.run_backward([self], grad_tensor, retain_graph)
ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr.
[Hint: tensor should not be null.] (at ../paddle/phi/core/device_context.cc:142)
bug描述 Describe the Bug
【问题简述】
分布式训练使用SyncBatchNorm的时候报错,使用正常的batchnorm是没有问题的
【代码】
模型启动脚本:python -m paddle.distributed.launch --selected_gpus=4,5,6 --log_dir=$log_dir train_large.py
train_large.py中模型定义:
model = TimeSeriesTransformer()
model = paddle.nn.SyncBatchNorm.convert_sync_batchnorm(model)
optim = paddle.optimizer.Adam(parameters=model.parameters())
model = fleet.distributed_model(model)
optim = fleet.distributed_optimizer(optim)
【报错信息】
C++ Traceback (most recent call last):
0 paddle::imperative::BasicEngine::Execute()
1 paddle::imperative::PreparedOp::Run(paddle::imperative::NameVariableWrapperMap const&, paddle::imperative::NameVariableWrapperMap const&, paddle::framework::AttributeMap const&, paddle::framework::AttributeMap const&)
2 std::_Function_handler, paddle::operators::SyncBatchNormGradKernel, paddle::operators::SyncBatchNormGradKernel >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
3 paddle::operators::SyncBatchNormGradKernel::Compute(paddle::framework::ExecutionContext const&) const
4 void paddle::operators::SyncBatchNormGradFunctor(paddle::framework::ExecutionContext const&, paddle::experimental::DataLayout, phi::DenseTensor const*, phi::DenseTensor const*, phi::DenseTensor*, phi::DenseTensor const*, phi::DenseTensor*, phi::DenseTensor*, phi::DenseTensor const*, phi::DenseTensor const*, double)
5 phi::DenseTensor::mutable_data(phi::Place const&, paddle::experimental::DataType, unsigned long)
6 phi::DenseTensor::set_type(paddle::experimental::DataType)
【做过的一些debug】
其他补充信息 Additional Supplementary Information
No response
The text was updated successfully, but these errors were encountered: