Skip to content

Main branch compilation broken with older hwloc & segfault with internal hwloc #11637

Closed
@wenduwan

Description

@wenduwan

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

42e577f

We confirmed that the issue is related to this change.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

On main branch

./autogen.pl
./configure --prefix=/home/ec2-user/ompi/install --with-sge --without-verbs --with-libfabric=/opt/amazon/efa --disable-man-pages --with-libevent=external --with-hwloc=external --enable-cuda --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs --disable-builtin-atomics --enable-debug
make -j install

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

$ git submodule status
 10fe4735ee374f5807c2160e61274c4aa53491ae 3rd-party/openpmix (v1.1.3-3847-g10fe4735)
 d8bd12b3ffda4af6918d641f024a6b0118789700 3rd-party/prrte (psrvr-v2.0.0rc1-4624-gd8bd12b3ff)
 c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: Amazon Linux 2/RHEL 7
  • Computer hardware: EC2 p4d.24xlarge instance
  • Network type: EFA

Interna Externall hwloc

$ yum list installed | grep hwloc
hwloc.x86_64                        1.11.8-4.amzn2                   @amzn2-core
hwloc-devel.x86_64                  1.11.8-4.amzn2                   @amzn2-core
hwloc-gui.x86_64                    1.11.8-4.amzn2                   @amzn2-core
hwloc-libs.x86_64                   1.11.8-4.amzn2                   @amzn2-core
hwloc-plugins.x86_64                1.11.8-4.amzn2                   @amzn2-core

Details of the problem

Problem 1: Compilation error with external hwloc

  CC       common_ofi.lo
  LN_S     libopen-palmca_common_ofi.la
common_ofi.c: In function 'is_near':
common_ofi.c:619:25: error: 'struct hwloc_obj' has no member named 'io_first_child'; did you mean 'first_child'?
     for(osdev = pcidev->io_first_child; osdev != NULL; osdev = osdev->next_sibling) {
                         ^~~~~~~~~~~~~~
                         first_child

Problem 2: Segfault with internal libevent & hwloc with OSU microbenchmark

In this case we did ./configure ... --with-libevent=internal --with-hwloc=internal ....

Then we ran omb

/home/ec2-user/ompi/install/bin/mpirun --hostfile hostfile --map-by ppr:2:node --bind-to none -x PATH=/home/ec2-user/ompi/install/bin:$PATH /home/ec2-user/omb/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr

Note: omb was build against ompi and hostfile has 2 p4d.24xlarge instances.

The segfault happens here(redacted paths for conciseness)

#0  0x00007ff9c81e0707 in __strncasecmp_l_avx () from /lib64/libc.so.6
#1  0x00007ff9c712e57f in pmix_hwloc_destruct_topology (src=0x7ff9c7a40a7a) at hwloc/pmix_hwloc_datatype.c:504
#2  0x00007ff9c728ff09 in pmix_bfrops_base_tma_topology_destruct (t=0x7ff9c7a40a7a, tma=0x0) at .../ompi/3rd-party/openpmix/src/mca/bfrops/base/bfrop_base_tma.h:901
#3  0x00007ff9c7290081 in pmix_bfrops_base_tma_topology_free (t=0x7ff9c7a40a7a, n=1, tma=0x0) at .../ompi/3rd-party/openpmix/src/mca/bfrops/base/bfrop_base_tma.h:950
#4  0x00007ff9c72995b1 in PMIx_Topology_free (t=0x7ff9c7a40a7a, n=1) at base/bfrop_base_macro_backers.c:327
#5  0x00007ff9c79d85f5 in compute_dev_distances (distances=0x7ffea0685be0, ndist=0x7ffea0685e20) at common_ofi.c:487
#6  0x00007ff9c79d875d in get_nearest_nics (num_distances=0x7ffea0685e7c, valin=0x7ffea0685e88) at common_ofi.c:535
#7  0x00007ff9c79d94f1 in opal_common_ofi_select_provider (provider_list=0x1e97d40, process_info=0x7ff9c7c78bc0 <opal_process_info>) at common_ofi.c:804
#8  0x00007ff9c8b19751 in select_ofi_provider (providers=0x1e97d40, include_list=0x0, exclude_list=0x1d3c310) at mtl_ofi_component.c:357
#9  0x00007ff9c8b1ab20 in ompi_mtl_ofi_component_init (enable_progress_threads=false, enable_mpi_threads=false, accelerator_support=0x7ff9c8fd9770 <mca_mtl_ofi_component+272>) at mtl_ofi_component.c:780
#10 0x00007ff9c8b0fb70 in ompi_mtl_base_select (enable_progress_threads=false, enable_mpi_threads=false, priority=0x7ffea06860fc) at base/mtl_base_frame.c:78
#11 0x00007ff9c8c77f42 in mca_pml_cm_component_init (priority=0x7ffea06860fc, enable_progress_threads=false, enable_mpi_threads=false) at pml_cm_component.c:146
#12 0x00007ff9c8c4a67f in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at base/pml_base_select.c:127
#13 0x00007ff9c892b8f1 in ompi_mpi_instance_init_common (argc=1, argv=0x7ffea0686f08) at instance/instance.c:508
#14 0x00007ff9c892c2da in ompi_mpi_instance_init (ts_level=0, info=0x62e9e0 <ompi_mpi_info_null>, errhandler=0x7ff9c8fea820 <ompi_mpi_errors_are_fatal>, instance=0x7ff9c8ffb900 <ompi_mpi_instance_default>, argc=1, argv=0x7ffea0686f08) at instance/instance.c:814
#15 0x00007ff9c891ba74 in ompi_mpi_init (argc=1, argv=0x7ffea0686f08, requested=0, provided=0x7ffea0686d7c, reinit_ok=false) at runtime/ompi_mpi_init.c:359
#16 0x00007ff9c89819cf in PMPI_Init (argc=0x7ffea0686dbc, argv=0x7ffea0686db0) at init.c:67
#17 0x000000000040269e in main (argc=<optimized out>, argv=<optimized out>) at osu_mbw_mr.c:49

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions