Closed
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
We confirmed that the issue is related to this change.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
On main branch
./autogen.pl
./configure --prefix=/home/ec2-user/ompi/install --with-sge --without-verbs --with-libfabric=/opt/amazon/efa --disable-man-pages --with-libevent=external --with-hwloc=external --enable-cuda --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs --disable-builtin-atomics --enable-debug
make -j install
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
$ git submodule status
10fe4735ee374f5807c2160e61274c4aa53491ae 3rd-party/openpmix (v1.1.3-3847-g10fe4735)
d8bd12b3ffda4af6918d641f024a6b0118789700 3rd-party/prrte (psrvr-v2.0.0rc1-4624-gd8bd12b3ff)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)
Please describe the system on which you are running
- Operating system/version: Amazon Linux 2/RHEL 7
- Computer hardware: EC2 p4d.24xlarge instance
- Network type: EFA
Interna Externall hwloc
$ yum list installed | grep hwloc
hwloc.x86_64 1.11.8-4.amzn2 @amzn2-core
hwloc-devel.x86_64 1.11.8-4.amzn2 @amzn2-core
hwloc-gui.x86_64 1.11.8-4.amzn2 @amzn2-core
hwloc-libs.x86_64 1.11.8-4.amzn2 @amzn2-core
hwloc-plugins.x86_64 1.11.8-4.amzn2 @amzn2-core
Details of the problem
Problem 1: Compilation error with external hwloc
CC common_ofi.lo
LN_S libopen-palmca_common_ofi.la
common_ofi.c: In function 'is_near':
common_ofi.c:619:25: error: 'struct hwloc_obj' has no member named 'io_first_child'; did you mean 'first_child'?
for(osdev = pcidev->io_first_child; osdev != NULL; osdev = osdev->next_sibling) {
^~~~~~~~~~~~~~
first_child
Problem 2: Segfault with internal libevent & hwloc with OSU microbenchmark
In this case we did ./configure ... --with-libevent=internal --with-hwloc=internal ...
.
Then we ran omb
/home/ec2-user/ompi/install/bin/mpirun --hostfile hostfile --map-by ppr:2:node --bind-to none -x PATH=/home/ec2-user/ompi/install/bin:$PATH /home/ec2-user/omb/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
Note: omb was build against ompi and hostfile has 2 p4d.24xlarge instances.
The segfault happens here(redacted paths for conciseness)
#0 0x00007ff9c81e0707 in __strncasecmp_l_avx () from /lib64/libc.so.6
#1 0x00007ff9c712e57f in pmix_hwloc_destruct_topology (src=0x7ff9c7a40a7a) at hwloc/pmix_hwloc_datatype.c:504
#2 0x00007ff9c728ff09 in pmix_bfrops_base_tma_topology_destruct (t=0x7ff9c7a40a7a, tma=0x0) at .../ompi/3rd-party/openpmix/src/mca/bfrops/base/bfrop_base_tma.h:901
#3 0x00007ff9c7290081 in pmix_bfrops_base_tma_topology_free (t=0x7ff9c7a40a7a, n=1, tma=0x0) at .../ompi/3rd-party/openpmix/src/mca/bfrops/base/bfrop_base_tma.h:950
#4 0x00007ff9c72995b1 in PMIx_Topology_free (t=0x7ff9c7a40a7a, n=1) at base/bfrop_base_macro_backers.c:327
#5 0x00007ff9c79d85f5 in compute_dev_distances (distances=0x7ffea0685be0, ndist=0x7ffea0685e20) at common_ofi.c:487
#6 0x00007ff9c79d875d in get_nearest_nics (num_distances=0x7ffea0685e7c, valin=0x7ffea0685e88) at common_ofi.c:535
#7 0x00007ff9c79d94f1 in opal_common_ofi_select_provider (provider_list=0x1e97d40, process_info=0x7ff9c7c78bc0 <opal_process_info>) at common_ofi.c:804
#8 0x00007ff9c8b19751 in select_ofi_provider (providers=0x1e97d40, include_list=0x0, exclude_list=0x1d3c310) at mtl_ofi_component.c:357
#9 0x00007ff9c8b1ab20 in ompi_mtl_ofi_component_init (enable_progress_threads=false, enable_mpi_threads=false, accelerator_support=0x7ff9c8fd9770 <mca_mtl_ofi_component+272>) at mtl_ofi_component.c:780
#10 0x00007ff9c8b0fb70 in ompi_mtl_base_select (enable_progress_threads=false, enable_mpi_threads=false, priority=0x7ffea06860fc) at base/mtl_base_frame.c:78
#11 0x00007ff9c8c77f42 in mca_pml_cm_component_init (priority=0x7ffea06860fc, enable_progress_threads=false, enable_mpi_threads=false) at pml_cm_component.c:146
#12 0x00007ff9c8c4a67f in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at base/pml_base_select.c:127
#13 0x00007ff9c892b8f1 in ompi_mpi_instance_init_common (argc=1, argv=0x7ffea0686f08) at instance/instance.c:508
#14 0x00007ff9c892c2da in ompi_mpi_instance_init (ts_level=0, info=0x62e9e0 <ompi_mpi_info_null>, errhandler=0x7ff9c8fea820 <ompi_mpi_errors_are_fatal>, instance=0x7ff9c8ffb900 <ompi_mpi_instance_default>, argc=1, argv=0x7ffea0686f08) at instance/instance.c:814
#15 0x00007ff9c891ba74 in ompi_mpi_init (argc=1, argv=0x7ffea0686f08, requested=0, provided=0x7ffea0686d7c, reinit_ok=false) at runtime/ompi_mpi_init.c:359
#16 0x00007ff9c89819cf in PMPI_Init (argc=0x7ffea0686dbc, argv=0x7ffea0686db0) at init.c:67
#17 0x000000000040269e in main (argc=<optimized out>, argv=<optimized out>) at osu_mbw_mr.c:49
Metadata
Metadata
Assignees
Labels
No labels