I'm seeing some very strange transient segfaults, where I can run this code 10 times and it errors inconsistently. I am building openmpi and prrte via spack from source.
mpirun -n 1000 true
[myhost-157:1301000] *** Process received signal ***
[myhost-157:1301000] Signal: Segmentation fault (11)
[myhost-157:1301000] Signal code: Address not mapped (1)
[myhost-157:1301000] Failing at address: 0x1b8
[myhost-157:1301000] [ 0] /lib64/libc.so.6(+0x3e6f0)[0x7efc71bd66f0]
[myhost-157:1301000] [ 1] /spack/opt/linux-icelake/hwloc-2.13.0-enoxcy27v34v6bjqlcqpxvq22h3aw2ga/lib/libhwloc.so.15(hwloc_topology_get_allowed_cpuset+0x0)[0x7efc71db0ee0]
[myhost-157:1301000] [ 2] /spack/opt/linux-icelake/prrte-4.1.0-ip2ammxivjakk24szc2c6kka2cjc4gte/lib/libprrte.so.3(prte_hwloc_base_filter_cpus+0x8c)[0x7efc729b615c]
[myhost-157:1301000] [ 3] /spack/opt/linux-icelake/prrte-4.1.0-ip2ammxivjakk24szc2c6kka2cjc4gte/lib/libprrte.so.3(prte_plm_base_daemon_callback+0x1b08)[0x7efc72a16da8]
[myhost-157:1301000] [ 4] /spack/opt/linux-icelake/prrte-4.1.0-ip2ammxivjakk24szc2c6kka2cjc4gte/lib/libprrte.so.3(prte_rml_base_process_msg+0x249)[0x7efc729bb7a9]
[myhost-157:1301000] [ 5] /spack/opt/linux-icelake/libevent-2.1.12-cchgrwhqu4eqr4v5stf2x2wklzwv3sn3/lib/libevent_core-2.1.so.7(+0x1dff2)[0x7efc720c4ff2]
[myhost-157:1301000] [ 6] /spack/opt/linux-icelake/libevent-2.1.12-cchgrwhqu4eqr4v5stf2x2wklzwv3sn3/lib/libevent_core-2.1.so.7(event_base_loop+0x47f)[0x7efc720c56bf]
[myhost-157:1301000] [ 7] /spack/opt/linux-icelake/prrte-4.1.0-ip2ammxivjakk24szc2c6kka2cjc4gte/lib/libprrte.so.3(prte+0x23d4)[0x7efc72992634]
[myhost-157:1301000] [ 8] /lib64/libc.so.6(+0x29590)[0x7efc71bc1590]
[myhost-157:1301000] [ 9] /lib64/libc.so.6(__libc_start_main+0x80)[0x7efc71bc1640]
[myhost-157:1301000] [10] prterun[0x401075]
[myhost-157:1301000] *** End of error message ***
Segmentation fault (core dumped)
running with mpirun --debug-daemons --prtemca plm_base_verbose 100 -n 1000 true
I have quite a bit of heterogeneity
NUMA[35,43,22,3:0]
NUMA[40,43,17,3:0]
NUMA[2:43,14,3:0]
and
RECEIVED TOPOLOGY SIG ... FROM NODE myhostcn-073
NEW TOPOLOGY - ADDING SIGNATURE
Then another node:
RECEIVED TOPOLOGY SIG ... FROM NODE myhostcn-128
TOPOLOGY SIGNATURE ALREADY RECORDED
and then the segfault. I'm wondering if you've seen this issue? The issue open-mpi/ompi#13357 seems similar but prrte 4.1 doesn't seem to resolve it fully for me.
I'm seeing some very strange transient segfaults, where I can run this code 10 times and it errors inconsistently. I am building openmpi and prrte via spack from source.
running with
mpirun --debug-daemons --prtemca plm_base_verbose 100 -n 1000 trueI have quite a bit of heterogeneity
and
Then another node:
and then the segfault. I'm wondering if you've seen this issue? The issue open-mpi/ompi#13357 seems similar but prrte 4.1 doesn't seem to resolve it fully for me.