Skip to content

test_parallel_executor.py 单测问题 #10880

@guochaorong

Description

@guochaorong

厂内CI机器环境下,(4卡机器)
test_parallel_executor.py单卡可以跑过。2,3,4卡 test_parallel_executor.py 均报下面错误
(ci测试地址:
https://paddleci.ngrok.io/viewLog.html?tab=buildLog&buildTypeId=GuochaorongPaddleTest_PrCi&buildId=36743

λ 2ddd25d22150 /paddle/python/paddle/fluid/tests/unittests {test_pr} python test_parallel_executor.py
..[155.00066 137.33183]
[93.353775 75.808044]
[81.401886 81.83884 ]
[71.54146 96.385376]
[95.43759 86.08732]
[79.38394 84.6459 ]
[ 73.67308 103.09319]
[84.86516 73.02386]
[80.49448 93.33059]
[66.14317 69.48563]
.[162.05154 108.96857]
[108.84401 105.92093]
[ 73.063446 110.11499 ]
[ 87.91005 105.59059]
[66.78179 81.02496]
[75.85283 70.4716 ]
[76.239716 69.979866]
[82.589294 64.007484]
[88.13671 70.73369]
[71.94618 93.646774]
.[126.612076 120.40341 ]
[106.69983 84.25634]
[91.51135 93.035736]
[99.53558 84.54377]
[ 97.76779 101.992836]
[93.61027 68.643875]
[96.5348 79.67081]
[68.6813 78.9785]
[86.46138 73.84549]
[77.626785 69.92791 ]
.*** Aborted at 1527077531 (unix time) try "date -d @1527077531" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x30) received by PID 26261 (TID 0x7f651ae33700) from PID 48; stack trace: ***
@ 0x7f651aa0e390 (unknown)
@ 0x7f64900c1078 (unknown)
@ 0x7f64900c5405 ncclCommInitAll
@ 0x7f64e8e8049a paddle::platform::NCCLContextMap::NCCLContextMap()
@ 0x7f64e8e7d4b7 paddle::framework::ParallelExecutor::ParallelExecutor()
@ 0x7f64e8e06e4f ZZN8pybind1112cpp_function10initializeIZNS_6detail4initIIRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS8_8CPUPlaceENS8_15CUDAPinnedPlaceENS5_6detail7variant5void_ESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaISF_EERKSt13unordered_setINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4hashISQ_ESt8equal_toISQ_ESaISQ_EESY_RKNS7_9framework11ProgramDescERKSQ_PNSZ_5ScopeERS4_IS16_SaIS16_EERKNSZ_7details17ExecutionStrategyERKNS1A_13BuildStrategyEmmEE7executeINS_6class_INSZ_16ParallelExecutorEIEEEIELi0EEEvRT_DpRKT0_EUlPS1K_SJ_SY_SY_S12_S14_S16_S19_S1D_S1G_mmE_vIS1S_SJ_SY_SY_S12_S14_S16_S19_S1D_S1G_mmEINS_4nameENS_9is_methodENS_7siblingEEEEvOS1M_PFT0_DpT1_EDpRKT2_ENKUlRNS2_13function_callEE1_clES28
@ 0x7f64e8e070fe ZZN8pybind1112cpp_function10initializeIZNS_6detail4initIJRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS8_8CPUPlaceENS8_15CUDAPinnedPlaceENS5_6detail7variant5void_ESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaISF_EERKSt13unordered_setINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4hashISQ_ESt8equal_toISQ_ESaISQ_EESY_RKNS7_9framework11ProgramDescERKSQ_PNSZ_5ScopeERS4_IS16_SaIS16_EERKNSZ_7details17ExecutionStrategyERKNS1A_13BuildStrategyEmmEE7executeINS_6class_INSZ_16ParallelExecutorEJEEEJELi0EEEvRT_DpRKT0_EUlPS1K_SJ_SY_SY_S12_S14_S16_S19_S1D_S1G_mmE_vJS1S_SJ_SY_SY_S12_S14_S16_S19_S1D_S1G_mmEJNS_4nameENS_9is_methodENS_7siblingEEEEvOS1M_PFT0_DpT1_EDpRKT2_ENUlRNS2_13function_callEE1_4_FUNES28
@ 0x7f64e8dc6474 pybind11::cpp_function::dispatcher()
@ 0x4eebee (unknown)
@ 0x4ee7f6 (unknown)
@ 0x4aa9ab (unknown)
@ 0x4c15bf PyEval_EvalFrameEx
@ 0x4b9ab6 PyEval_EvalCodeEx
@ 0x4d55f3 (unknown)
@ 0x4eebee (unknown)
@ 0x4ee7f6 (unknown)
@ 0x4aa9ab (unknown)
@ 0x4c15bf PyEval_EvalFrameEx
@ 0x4b9ab6 PyEval_EvalCodeEx
@ 0x4c16e7 PyEval_EvalFrameEx
@ 0x4c136f PyEval_EvalFrameEx
@ 0x4b9ab6 PyEval_EvalCodeEx
@ 0x4d55f3 (unknown)
@ 0x4a577e PyObject_Call
@ 0x4bed3d PyEval_EvalFrameEx
@ 0x4b9ab6 PyEval_EvalCodeEx
@ 0x4d54b9 (unknown)
@ 0x4eebee (unknown)
@ 0x4a577e PyObject_Call
@ 0x548253 (unknown)
@ 0x4c15bf PyEval_EvalFrameEx
@ 0x4b9ab6 PyEval_EvalCodeEx
Segmentation fault

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions