Description
- This is an issue which records my work to debugging the issue of training large CTR model with distributed sparse remote parameter updating.
Background
In CTR model training, we can use a very large feature in the LR part of the model, causing the models size is not able to store in one trainer even it's in the "sparse row format". So we need to make this part of model store evenly on the pservers and trainers can only fetch part of the rows in prefetch.
Refer to here for some details. This feature should be re-written in the refactored code.
Records
Using V1 CTR model config(wide part)
def widectr_net():
signs = data_layer("feasigns", int(1e2))
lr = fc_layer(input=signs, size=128, act=SigmoidActivation(), param_attr=ParamAttr(sparse_update=True))
return lr
Start 10 pservers and 20 trainers, trainer command args:
/usr/local/bin/paddle_trainer --port=7164 --nics=eth0 --ports_num=1 --ports_num_for_sparse=1 --num_passes=1 --trainer_count=1 --saving_period=1 --log_period=20 --local=0 --rdma_tcp=tcp --config=train.py --use_gpu=0 --trainer_id=8 --save_dir= --pservers=...... --num_gradient_servers=20 --loadsave_parameters_in_pserver=1 --use_old_updater=1 -v 100
Then trainer stuck at calling "add gradient", but the prefetch is OK. Then the trainer fails with "timeout". Some logs:
Tips: updatemode: 3(PSERVER_UPDATE_MODE_ADD_GRADIENT
), 6(PSERVER_UPDATE_MODE_GET_PARAM_SPARSE
)
I1025 01:58:37.992717 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992750 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992755 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992758 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992763 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992766 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992769 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992772 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992776 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992780 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992873 71 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 0 tid 0 blockId 8
I1025 01:58:37.992893 71 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 0 tid 0 blockId 18
I1025 01:58:37.992897 71 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 0 tid 0 blockId 28
I1025 01:58:37.992899 71 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 0 tid 0 blockId 38
I1025 01:58:37.992902 71 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 0 tid 0 blockId 48
...
I1025 01:58:37.993465 77 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 6 tid 6 blockId 84
I1025 01:58:37.993469 77 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 6 tid 6 blockId 94
I1025 01:58:37.993535 71 ParameterClient2.cpp:166] sendParallel, tid: 0 numMyClients 1 numThreads 10
I1025 01:58:37.993538 72 ParameterClient2.cpp:166] sendParallel, tid: 1 numMyClients 1 numThreads 10
I1025 01:58:37.993541 74 ParameterClient2.cpp:166] sendParallel, tid: 3 numMyClients 1 numThreads 10
I1025 01:58:37.993547 71 ParameterClient2.cpp:174] #### before recv, i: 8
I1025 01:58:37.993548 72 ParameterClient2.cpp:174] #### before recv, i: 9
I1025 01:58:37.993541 78 ParameterClient2.cpp:166] sendParallel, tid: 7 numMyClients 1 numThreads 10
I1025 01:58:37.993576 77 ParameterClient2.cpp:166] sendParallel, tid: 6 numMyClients 1 numThreads 10
I1025 01:58:37.993587 79 ParameterClient2.cpp:166] sendParallel, tid: 8 numMyClients 1 numThreads 10
I1025 01:58:37.993553 74 ParameterClient2.cpp:174] #### before recv, i: 1
I1025 01:58:37.993597 78 ParameterClient2.cpp:174] #### before recv, i: 5
I1025 01:58:37.993599 79 ParameterClient2.cpp:174] #### before recv, i: 6
I1025 01:58:37.993558 73 ParameterClient2.cpp:166] sendParallel, tid: 2 numMyClients 1 numThreads 10
I1025 01:58:37.993538 75 ParameterClient2.cpp:166] sendParallel, tid: 4 numMyClients 1 numThreads 10
I1025 01:58:37.993597 77 ParameterClient2.cpp:174] #### before recv, i: 4
I1025 01:58:37.993616 75 ParameterClient2.cpp:174] #### before recv, i: 2
I1025 01:58:37.993569 80 ParameterClient2.cpp:166] sendParallel, tid: 9 numMyClients 1 numThreads 10
I1025 01:58:37.993613 73 ParameterClient2.cpp:174] #### before recv, i: 0
I1025 01:58:37.993558 76 ParameterClient2.cpp:166] sendParallel, tid: 5 numMyClients 1 numThreads 10
I1025 01:58:37.993628 80 ParameterClient2.cpp:174] #### before recv, i: 7
I1025 01:58:37.993633 76 ParameterClient2.cpp:174] #### before recv, i: 3
I1025 01:58:38.435159 57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435159 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435195 57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435205 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435209 57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435211 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435214 57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435215 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435217 57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435220 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435222 57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435225 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435227 57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435232 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435237 57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435241 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435241 57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435247 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435252 57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435256 58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435319 75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 2
I1025 01:58:38.435331 75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 12
I1025 01:58:38.435336 75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 22
I1025 01:58:38.435340 75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 32
I1025 01:58:38.435344 75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 42
I1025 01:58:38.435348 75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 52
...
I1025 01:58:38.437079 74 ParameterClient2.cpp:166] sendParallel, tid: 3 numMyClients 1 numThreads 10
I1025 01:58:38.437093 74 ParameterClient2.cpp:174] #### before recv, i: 1
I1025 01:58:38.437126 77 ParameterClient2.cpp:166] sendParallel, tid: 6 numMyClients 1 numThreads 10
I1025 01:58:38.437077 75 ParameterClient2.cpp:166] sendParallel, tid: 4 numMyClients 1 numThreads 10
I1025 01:58:38.437081 76 ParameterClient2.cpp:166] sendParallel, tid: 5 numMyClients 1 numThreads 10
I1025 01:58:38.437134 77 ParameterClient2.cpp:174] #### before recv, i: 4
I1025 01:58:38.437167 72 ParameterClient2.cpp:166] sendParallel, tid: 1 numMyClients 1 numThreads 10
I1025 01:58:38.437170 78 ParameterClient2.cpp:166] sendParallel, tid: 7 numMyClients 1 numThreads 10
I1025 01:58:38.437081 80 ParameterClient2.cpp:166] sendParallel, tid: 9 numMyClients 1 numThreads 10
I1025 01:58:38.437150 75 ParameterClient2.cpp:174] #### before recv, i: 2
I1025 01:58:38.437180 80 ParameterClient2.cpp:174] #### before recv, i: 7
I1025 01:58:38.437172 72 ParameterClient2.cpp:174] #### before recv, i: 9
I1025 01:58:38.437209 71 ParameterClient2.cpp:166] sendParallel, tid: 0 numMyClients 1 numThreads 10
I1025 01:58:38.437213 71 ParameterClient2.cpp:174] #### before recv, i: 8
I1025 01:58:38.437163 76 ParameterClient2.cpp:174] #### before recv, i: 3
I1025 01:58:38.437249 73 ParameterClient2.cpp:166] sendParallel, tid: 2 numMyClients 1 numThreads 10
I1025 01:58:38.437255 73 ParameterClient2.cpp:174] #### before recv, i: 0
I1025 01:58:38.437178 78 ParameterClient2.cpp:174] #### before recv, i: 5
I1025 01:58:38.437134 79 ParameterClient2.cpp:166] sendParallel, tid: 8 numMyClients 1 numThreads 10
I1025 01:58:38.437306 79 ParameterClient2.cpp:174] #### before recv, i: 6
I1025 01:58:38.636719 87 ParameterClient2.cpp:166] sendParallel, tid: 6 numMyClients 1 numThreads 10
I1025 01:58:38.636740 87 ParameterClient2.cpp:174] #### before recv, i: 4
I1025 01:58:38.644503 89 ParameterClient2.cpp:166] sendParallel, tid: 8 numMyClients 1 numThreads 10
I1025 01:58:38.644520 89 ParameterClient2.cpp:174] #### before recv, i: 6
I1025 01:58:38.649602 90 ParameterClient2.cpp:166] sendParallel, tid: 9 numMyClients 1 numThreads 10
I1025 01:58:38.649615 90 ParameterClient2.cpp:174] #### before recv, i: 7
I1025 01:58:38.650900 85 ParameterClient2.cpp:166] sendParallel, tid: 4 numMyClients 1 numThreads 10
I1025 01:58:38.650910 85 ParameterClient2.cpp:174] #### before recv, i: 2
I1025 01:58:38.659765 83 ParameterClient2.cpp:166] sendParallel, tid: 2 numMyClients 1 numThreads 10
I1025 01:58:38.659776 83 ParameterClient2.cpp:174] #### before recv, i: 0
I1025 01:58:38.669888 88 ParameterClient2.cpp:166] sendParallel, tid: 7 numMyClients 1 numThreads 10
I1025 01:58:38.669898 88 ParameterClient2.cpp:174] #### before recv, i: 5
I1025 01:58:38.703678 82 ParameterClient2.cpp:166] sendParallel, tid: 1 numMyClients 1 numThreads 10
I1025 01:58:38.703691 82 ParameterClient2.cpp:174] #### before recv, i: 9
I1025 01:58:38.715457 84 ParameterClient2.cpp:166] sendParallel, tid: 3 numMyClients 1 numThreads 10
I1025 01:58:38.715476 84 ParameterClient2.cpp:174] #### before recv, i: 1
I1025 01:58:38.758709 81 ParameterClient2.cpp:166] sendParallel, tid: 0 numMyClients 1 numThreads 10
I1025 01:58:38.758720 81 ParameterClient2.cpp:174] #### before recv, i: 8
I1025 01:58:38.780829 86 ParameterClient2.cpp:166] sendParallel, tid: 5 numMyClients 1 numThreads 10
I1025 01:58:38.780840 86 ParameterClient2.cpp:174] #### before recv, i: 3
Some of the pserver fails at:
I1025 01:58:17.467772 82 ParameterServer2.cpp:564] pserver: getParameter
I1025 01:58:18.902704 83 LightNetwork.cpp:326] worker started, peer = 192.168.27.222
I1025 01:58:18.926435 84 LightNetwork.cpp:326] worker started, peer = 192.168.27.222
I1025 01:58:20.928249 84 ParameterServer2.cpp:564] pserver: getParameter
I1025 01:58:35.682245 85 LightNetwork.cpp:326] worker started, peer = 192.168.139.150
I1025 01:58:35.705991 86 LightNetwork.cpp:326] worker started, peer = 192.168.139.150
I1025 01:58:37.707690 86 ParameterServer2.cpp:564] pserver: getParameter
F1025 01:58:52.261445 48 SocketChannel.cpp:101] Check failed: len > 0 peer=192.168.24.151 curIov=22 iovCnt=89 iovs[curIov].base=0x7fa1b09d75ca iovs[curIov].iov_len=10870
*** Check failure stack trace: ***
@ 0xa5904d google::LogMessage::Fail()
@ 0xa5b398 google::LogMessage::SendToLog()
@ 0xa58b5b google::LogMessage::Flush()
@ 0xa5c26e google::LogMessageFatal::~LogMessageFatal()
@ 0x884a04 paddle::SocketChannel::writev()
@ 0x885b98 paddle::SocketChannel::writeMessage()
@ 0x8794cc _ZZZN6paddle11ProtoServer25registerServiceFunctionExINS_15SendDataRequestEEEvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvRKT_St10unique_ptrINS_9MsgReaderESt14default_deleteISG_EESB_IFvRKN6google8protobuf11MessageLiteERKSt6vectorI5iovecSaISQ_EEEEEEENKUlSJ_SB_IFvSU_EEE_clESJ_S10_ENKUlSO_SU_E_clESO_SU_
@ 0x86e4ba paddle::ParameterServer2::sendParameter()
@ 0x876c5a std::_Function_handler<>::_M_invoke()
@ 0x87a3de _ZNSt17_Function_handlerIFvSt10unique_ptrIN6paddle9MsgReaderESt14default_deleteIS2_EESt8functionIFvRKSt6vectorI5iovecSaIS8_EEEEEZNS1_11ProtoServer25registerServiceFunctionExINS1_20SendParameterRequestEEEvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_IFvRKT_S5_S6_IFvRKN6google8protobuf11MessageLiteESC_EEEEEUlS5_SE_E_E9_M_invokeERKSt9_Any_dataOS5_OSE_
@ 0x88648a paddle::ProtoServer::handleRequest()
@ 0x88412f paddle::SocketWorker::run()
@ 0x7fa235f1ac80 (unknown)
@ 0x7fa2363ef6ba start_thread
@ 0x7fa2356803dd clone
@ (nil) (unknown)
/usr/local/bin/paddle: line 96: 27 Aborted ${DEBUGGER} $PADDLE_BIN_PATH/paddle_pserver_main ${@:2}