Description
I am trying to train imagenet using the default resnet on a single node having upto 4 P100s.. When I use the master branch, I see hangs. When I attached gdb I see the following stack trace. If there are useful inputs, I can debug the problem more. The problem happens with more than 2 GPUs. With 2GPUs, I can run upto several epochs. However when I use 4 GPUs, it hangs within first epoch.
(gdb) bt
#0 0x00003fffac2cdd60 in pthread_cond_wait@@GLIBC_2.17 () at /lib64/libpthread.so.0
#1 0x00003fff4777608c in std::condition_variable::wait(std::unique_lockstd::mutex&) () at /lib64/libstdc++.so.6
#2 0x00003fff6a3e236c in std::condition_variable::waitmxnet::engine::ThreadedEngine::WaitForVar(mxnet::Engine::VarHandle)::__lambda18(std::unique_lockstd::mutex &, mxnet::engine::ThreadedEngine::__lambda18) (this=0x3fff2c001198, __lock=..., __p=...) at /usr/include/c++/4.8.2/condition_variable:93
#3 0x00003fff6a3e1d10 in mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) (this=0x3fff2c001150, var=0x3bff50a6a900) at src/engine/threaded_engine.cc:358
#4 0x00003fff699b6cc8 in mxnet::NDArray::WaitToWrite() const (this=0x3bff49fa0cf0) at include/mxnet/./ndarray.h:330
#5 0x00003fff69be4c88 in mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const (this=0x3bff49fa0cf0, data=0x3bff9c9862c0, size=32) at src/ndarray/ndarray.cc:1210
#6 0x00003fff6a44d190 in MXNDArraySyncCopyToCPU(NDArrayHandle, void*, size_t) (handle=0x3bff49fa0cf0, data=0x3bff9c9862c0, size=32) at src/c_api/c_api.cc:253
#7 0x00003fffabed7254 in () at /lib64/libffi.so.6
#8 0x00003fffabed5f50 in ffi_call () at /lib64/libffi.so.6
#9 0x00003fffa5247b24 in _ctypes_callproc () at /usr/lib64/python2.7/lib-dynload/_ctypes.so
#10 0x00003fffa523a6ac in PyCFuncPtr_call () at /usr/lib64/python2.7/lib-dynload/_ctypes.so
#11 0x00003fffac361444 in PyObject_Call () at /lib64/libpython2.7.so.1.0
#12 0x00003fffac4669f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#13 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#14 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#15 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#16 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#17 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#18 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#19 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#20 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#21 0x00003fffac468c70 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#22 0x00003fffac468c70 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#23 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#24 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#25 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#26 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
#27 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
#28 0x00003fffac46cc64 in PyEval_EvalCode () at /lib64/libpython2.7.so.1.0
#29 0x00003fffac4a0528 in PyRun_FileExFlags () at /lib64/libpython2.7.so.1.0
#30 0x00003fffac4a274c in PyRun_SimpleFileExFlags () at /lib64/libpython2.7.so.1.0
#31 0x00003fffac4a2e9c in PyRun_AnyFileExFlags () at /lib64/libpython2.7.so.1.0
#32 0x00003fffac4beb7c in Py_Main () at /lib64/libpython2.7.so.1.0
#33 0x0000000010000738 in main ()