Skip to content

Can not Get PsKey for the training with fault towerance mode #2969

@Yancey0623

Description

@Yancey0623

I start up master, pserver and trainer in a Docker container, but the trainer can not get the PServer address from etcd, the error logs as below:

['/work/data/uci_housing_train-*-of-*']
ERRO[0000] Get task failed, sleep 3 seconds and continue, no more available task
I0719 08:12:24.602708   824 Util.cpp:166] commandline:
I0719 08:12:24.607319   824 GradientMachine.cpp:85] Initing parameters..
I0719 08:12:24.607365   824 GradientMachine.cpp:92] Init parameters done.
INFO[0000] Connected to etcd: localhost:2379

I0719 08:12:24.962303   824 NewRemoteParameterUpdater.cpp:68] paddle_begin_init_params start
I0719 08:12:24.962774   824 NewRemoteParameterUpdater.cpp:71] old param config: name: "___fc_layer_0__.w0"
size: 13
initial_mean: 0
initial_std: 0.27735009811261457
dims: 13
dims: 1
initial_strategy: 0
initial_smart: true
para_id: 0
INFO[0000] Get psKey= /ps/0 error, context canceled

ERRO[0003] Get task failed, sleep 3 seconds and continue, no more available task
ERRO[0006] Get task failed, sleep 3 seconds and continue, no more available task
ERRO[0009] Get task failed, sleep 3 seconds and continue, no more available task
INFO[0010] Get psKey= /ps/0 error, context canceled

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions