This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
Test failure and possible bug on GPU topology algorithm (test_device.test_device_pushpull) #12994
Open
Description
Description
Failure in test_device.test_device_pushpull is reported by NVidia in DGX1V.
I suspect there is a bug on the binary tree creation. I'm looking into this issue.
ERROR: test_device.test_device_pushpull
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/opt/mxnet/tests/python/gpu/test_device.py", line 74, in test_device_pushpull
check_dense_pushpull('device')
File "/opt/mxnet/tests/python/gpu/test_device.py", line 61, in check_dense_pushpull
kv_device.push(cur_key, arr_list)
File "/opt/mxnet/python/mxnet/kvstore.py", line 234, in push
self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
File "/opt/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:44:02] src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 2 using backtracking
Environment info (Required)
What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.
Package used (Python/R/Scala/Julia):
(I'm using ...)
For Scala user, please provide:
- Java version: (
java -version
) - Maven version: (
mvn -version
) - Scala runtime if applicable: (
scala -version
)
For R user, please provide R sessionInfo()
:
Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio):
MXNet commit hash:
(Paste the output of git rev-parse HEAD
here.)
Build config:
(Paste the content of config.mk, or the build command.)
Error Message:
(Paste the complete error message, including stack trace.)
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees
[17:47:41] src/kvstore/././comm_tree.h:392: Using Tree
[17:47:41] src/kvstore/././comm_tree.h:489: Size 10 occurs 1 times
[17:47:41] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees
[17:47:41] src/kvstore/././comm_tree.h:392: Using Tree
[17:47:41] src/kvstore/././comm_tree.h:489: Size 10 occurs 1 times
[17:47:41] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees
Traceback (most recent call last):
File "test_device.py", line 82, in <module>
test_device_pushpull()
File "test_device.py", line 74, in test_device_pushpull
check_dense_pushpull('device')
File "test_device.py", line 61, in check_dense_pushpull
kv_device.push(cur_key, arr_list)
File "/opt/mxnet/python/mxnet/kvstore.py", line 234, in push
self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
File "/opt/mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:47:41] src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 2 using backtracking
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7ffa6698659c]
[bt] (1) /usr/local/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7ffa66987918]
[bt] (2) /usr/local/lib/libmxnet.so(void mxnet::kvstore::ComputeTreesFromRoot<float>(std::vector<float, std::allocator<float> >*, int, int, float, bool, std::vector<unsigned long, std::allocator<unsigned long> >*, std::vector<unsigned long, std::allocator<unsigned long> >*)+0x1a65) [0x7ffa69a59ff5]
[bt] (3) /usr/local/lib/libmxnet.so(void mxnet::kvstore::ComputeTrees<float>(std::vector<float, std::allocator<float> > const&, int, float, bool, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >*, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >*)+0x5b5) [0x7ffa69a5a815]
[bt] (4) /usr/local/lib/libmxnet.so(mxnet::kvstore::CommDeviceTree::QueryTopology()+0x1609) [0x7ffa69a5d409]
[bt] (5) /usr/local/lib/libmxnet.so(mxnet::kvstore::CommDeviceTree::Reduce(int, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x137c) [0x7ffa69a5f0cc]
[bt] (6) /usr/local/lib/libmxnet.so(mxnet::kvstore::KVStoreLocal::PushImpl(std::vector<int, std::allocator<int> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x1b9) [0x7ffa69a60ec9]
[bt] (7) /usr/local/lib/libmxnet.so(mxnet::kvstore::KVStoreLocal::Push(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0xc6) [0x7ffa69a02ee6]
[bt] (8) /usr/local/lib/libmxnet.so(MXKVStorePushEx+0x205) [0x7ffa6993d1d5]
[bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7ffab18e9e20]
Minimum reproducible example
(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
Steps to reproduce
(Paste the commands you ran that produced the error.)