Skip to content

benchmark/fluid/resnet.py 8卡训练时报错,设置batch_size为512 #10857

@kolinwei

Description

@kolinwei

P40的机器
命令如下:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python resnet.py --batch_size=512 --iterations=80
训练到11个batch时报如下错:
Pass: 0, Iter: 12, Loss: 4.753108, Accuracy: 0.050781
Traceback (most recent call last):
File "resnet.py", line 317, in
run_benchmark(model_map[args.model], args)
File "resnet.py", line 272, in run_benchmark
avg_cost.name, batch_acc.name, batch_size_tensor.name
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/parallel_executor.py", line 213, in run
feed_tensor_dict)
paddle.fluid.core.EnforceNotMet: enforce member_->places_.size() == lod_tensors.size() failed, 8 != 5
The number of samples of current batch is less than the count of devices, currently, it is not allowed. (8 vs 5) at [/paddle/paddle/fluid/framework/parallel_executor.cc:233]

Metadata

Metadata

Labels

User用于标记用户问题

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions