Skip to content

paddle 大模型,paddle::GradientMachine::randParameters() Segmentation fault #3199

Closed
@yinyunfeng

Description

@yinyunfeng

the error info:

Thu Aug 3 00:42:46 2017[1,2]:+ ./paddle_trainer --num_gradient_servers=6 --trainer_id=2 --pservers=10.73.218.49,10.73.218.48,10.73.218.45,10.73.226.18,10.73.226.16,10.73.226.21 --rdma_tcp=tcp --nics=xgbe0 --port=7164 --ports_num=1 --dot_period=100 --use_old_updater=1 --test_all_data_in_one_period=true --pserver_num_threads=5 --loadsave_parameters_in_pserver=1 --log_period=100 --trainer_count=16 --ports_num_for_sparse=1 --num_passes=50 --saving_period=1 --local=0 --config=conf/trainer_config.conf --save_dir=./output --use_gpu=0
Thu Aug 3 00:43:28 2017[1,2]:*** Aborted at 1501692208 (unix time) try "date -d @1501692208" if you are using GNU date ***
Thu Aug 3 00:43:28 2017[1,2]:PC: @ 0x6ec623 paddle::GradientMachine::randParameters()
Thu Aug 3 00:43:28 2017[1,2]:*** SIGSEGV (@0x30) received by PID 3852 (TID 0x7f1f79428780) from PID 48; stack trace: ***
Thu Aug 3 00:43:28 2017[1,2]: @ 0x7f1f79002160 (unknown)
Thu Aug 3 00:43:28 2017[1,2]: @ 0x6ec623 paddle::GradientMachine::randParameters()
Thu Aug 3 00:43:28 2017[1,2]: @ 0x72a73c paddle::Trainer::init()
Thu Aug 3 00:43:28 2017[1,2]: @ 0x575ed4 main
Thu Aug 3 00:43:28 2017[1,2]: @ 0x7f1f77c0fbd5 __libc_start_main
Thu Aug 3 00:43:28 2017[1,2]: @ 0x584cc1 (unknown)
Thu Aug 3 00:43:32 2017[1,2]:./train.sh: line 207: 3852 Segmentation fault PYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg}
Thu Aug 3 00:43:32 2017[1,2]:+ '[' 139 -ne 0 ']'
Thu Aug 3 00:43:32 2017[1,2]:+ kill_pserver2_exit
Thu Aug 3 00:43:32 2017[1,2]:+ ps aux
Thu Aug 3 00:43:32 2017[1,2]:+ grep paddle_pserver2
Thu Aug 3 00:43:32 2017[1,2]:+ grep paddle_cluster_job
Thu Aug 3 00:43:32 2017[1,2]:+ grep -v grep
Thu Aug 3 00:43:32 2017[1,2]:+ cut -c10-14
Thu Aug 3 00:43:32 2017[1,2]:+ xargs kill -9
Thu Aug 3 00:43:32 2017[1,2]:+ log_fatal 'paddle_trainer failed kill paddle_pserver2 and exit'
Thu Aug 3 00:43:32 2017[1,2]:+ echo '[./common.sh : 399] [kill_pserver2_exit]'
Thu Aug 3 00:43:32 2017[1,2]:[./common.sh : 399] [kill_pserver2_exit]

the job link is:
http://10.73.218.49:8920/fileview.html?path=/home/normandy/maybach/280706/workspace/log

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions