Skip to content

Load model bug in ParallelExecutor. #9830

@qingqing01

Description

@qingqing01

When loading the saved model for ParallelExecutor training. There is error:

  File "train.py", line 229, in train_parallel_exe
    train_exe = fluid.ParallelExecutor(use_cuda=True, loss_name=avg_cost.name, num_threads=cards_num)
  File "/home/users/dangqingqing/.jumbo/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 120, in __init__
    allow_op_delay)
paddle.fluid.core.EnforceNotMet: Not supported at [/home/users/dangqingqing/Paddle/paddle/fluid/platform/nccl_helper.h:36]
PaddlePaddle Call Stacks:
0       0x7f98138e3e5ep paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 558
1       0x7f98139b253dp paddle::platform::ToNCCLDataType(std::type_index) + 253
2       0x7f98139af29dp paddle::framework::ParallelExecutor::BCastParamsToGPUs(std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) const + 1677

Now the NCCL in Fluid not support int64 data type. But we have persistable variable with int64, like:

  vars {
    name: "@LR_DECAY_COUNTER@"
    type {
      type: LOD_TENSOR
      lod_tensor {
        tensor {
          data_type: INT64
          dims: 1
        }
        lod_level: 0
      }
    }
    persistable: true
  }

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions