Skip to content

Inconsistent segfaults on non-uniform topology #2426

@Chrismarsh

Description

@Chrismarsh

I'm seeing some very strange transient segfaults, where I can run this code 10 times and it errors inconsistently. I am building openmpi and prrte via spack from source.

 mpirun -n 1000 true
[myhost-157:1301000] *** Process received signal ***
[myhost-157:1301000] Signal: Segmentation fault (11)
[myhost-157:1301000] Signal code: Address not mapped (1)
[myhost-157:1301000] Failing at address: 0x1b8
[myhost-157:1301000] [ 0] /lib64/libc.so.6(+0x3e6f0)[0x7efc71bd66f0]
[myhost-157:1301000] [ 1] /spack/opt/linux-icelake/hwloc-2.13.0-enoxcy27v34v6bjqlcqpxvq22h3aw2ga/lib/libhwloc.so.15(hwloc_topology_get_allowed_cpuset+0x0)[0x7efc71db0ee0]
[myhost-157:1301000] [ 2] /spack/opt/linux-icelake/prrte-4.1.0-ip2ammxivjakk24szc2c6kka2cjc4gte/lib/libprrte.so.3(prte_hwloc_base_filter_cpus+0x8c)[0x7efc729b615c]
[myhost-157:1301000] [ 3] /spack/opt/linux-icelake/prrte-4.1.0-ip2ammxivjakk24szc2c6kka2cjc4gte/lib/libprrte.so.3(prte_plm_base_daemon_callback+0x1b08)[0x7efc72a16da8]
[myhost-157:1301000] [ 4] /spack/opt/linux-icelake/prrte-4.1.0-ip2ammxivjakk24szc2c6kka2cjc4gte/lib/libprrte.so.3(prte_rml_base_process_msg+0x249)[0x7efc729bb7a9]
[myhost-157:1301000] [ 5] /spack/opt/linux-icelake/libevent-2.1.12-cchgrwhqu4eqr4v5stf2x2wklzwv3sn3/lib/libevent_core-2.1.so.7(+0x1dff2)[0x7efc720c4ff2]
[myhost-157:1301000] [ 6] /spack/opt/linux-icelake/libevent-2.1.12-cchgrwhqu4eqr4v5stf2x2wklzwv3sn3/lib/libevent_core-2.1.so.7(event_base_loop+0x47f)[0x7efc720c56bf]
[myhost-157:1301000] [ 7] /spack/opt/linux-icelake/prrte-4.1.0-ip2ammxivjakk24szc2c6kka2cjc4gte/lib/libprrte.so.3(prte+0x23d4)[0x7efc72992634]
[myhost-157:1301000] [ 8] /lib64/libc.so.6(+0x29590)[0x7efc71bc1590]
[myhost-157:1301000] [ 9] /lib64/libc.so.6(__libc_start_main+0x80)[0x7efc71bc1640]
[myhost-157:1301000] [10] prterun[0x401075]
[myhost-157:1301000] *** End of error message ***
Segmentation fault (core dumped)

running with mpirun --debug-daemons --prtemca plm_base_verbose 100 -n 1000 true
I have quite a bit of heterogeneity

NUMA[35,43,22,3:0]
NUMA[40,43,17,3:0]
NUMA[2:43,14,3:0]

and

RECEIVED TOPOLOGY SIG ... FROM NODE myhostcn-073
NEW TOPOLOGY - ADDING SIGNATURE

Then another node:

RECEIVED TOPOLOGY SIG ... FROM NODE myhostcn-128
TOPOLOGY SIGNATURE ALREADY RECORDED

and then the segfault. I'm wondering if you've seen this issue? The issue open-mpi/ompi#13357 seems similar but prrte 4.1 doesn't seem to resolve it fully for me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions