This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Open
Description
I am working on a bug fix for mxnet master with my horovod branch: https://github.com/eric-haibin-lin/horovod/tree/mx2
I noticed that the example passes if I use mxnet built from source:
# install mxnet
git clone --recursive https://github.com/apache/incubator-mxnet.git mxnet
cd mxnet
cp config/linux.cmake config.cmake
rm -rf build
mkdir -p build && cd build
cmake -GNinja ..
cmake --build . --parallel 48
cd ../python; python setup develop --user;
cd ./mxnet; ln -s ../../include include; ln -s ../../3rdparty 3rdparty;
# install horovod
cd horovod; python setup.py install --user;
# run example
cd example; horovodrun -np 2 mxnet2_mnist.py
However, it segfault immediate after the first broadcast call if I use the mxnet nightly pip wheel from https://repo.mxnet.io/dist/python such as:
https://repo.mxnet.io/dist/python/cpu/mxnet-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
----------Python Info----------
Version : 3.7.6
Compiler : GCC 7.3.1 20180712 (Red Hat 7.3.1-6)
Build : ('default', 'Feb 26 2020 20:54:15')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 20.1.1
Directory : /home/ec2-user/.local/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version : 2.0.0
Directory : /home/ec2-user/src/mxnet/python/mxnet
Num GPUs : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-4.14.173-137.229.amzn2.x86_64-x86_64-with-glibc2.2.5
system : Linux
node : ip-172-31-81-80.ec2.internal
release : 4.14.173-137.229.amzn2.x86_64
version : #1 SMP Wed Apr 1 18:06:08 UTC 2020
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping: 7
CPU MHz: 1208.761
BogoMIPS: 4999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni