Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

horovod seg-fault with mxnet pip wheels  #18772

Open
@eric-haibin-lin

Description

I am working on a bug fix for mxnet master with my horovod branch: https://github.com/eric-haibin-lin/horovod/tree/mx2

I noticed that the example passes if I use mxnet built from source:

# install mxnet 
git clone --recursive https://github.com/apache/incubator-mxnet.git mxnet
cd mxnet
cp config/linux.cmake config.cmake
rm -rf build
mkdir -p build && cd build
cmake -GNinja ..
cmake --build . --parallel 48
cd ../python; python setup develop --user; 
cd ./mxnet; ln -s ../../include include; ln -s ../../3rdparty 3rdparty; 

# install horovod 
cd horovod; python setup.py install --user; 

# run example 
cd example; horovodrun -np 2 mxnet2_mnist.py 

However, it segfault immediate after the first broadcast call if I use the mxnet nightly pip wheel from https://repo.mxnet.io/dist/python such as:
https://repo.mxnet.io/dist/python/cpu/mxnet-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl

----------Python Info----------
Version      : 3.7.6
Compiler     : GCC 7.3.1 20180712 (Red Hat 7.3.1-6)
Build        : ('default', 'Feb 26 2020 20:54:15')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 20.1.1
Directory    : /home/ec2-user/.local/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 2.0.0
Directory    : /home/ec2-user/src/mxnet/python/mxnet
Num GPUs     : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.14.173-137.229.amzn2.x86_64-x86_64-with-glibc2.2.5
system       : Linux
node         : ip-172-31-81-80.ec2.internal
release      : 4.14.173-137.229.amzn2.x86_64
version      : #1 SMP Wed Apr 1 18:06:08 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:            7
CPU MHz:             1208.761
BogoMIPS:            4999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions