-
Notifications
You must be signed in to change notification settings - Fork 912
Open
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
The current master branch: 65ca64f
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From a git clone, as follows:
$ git clone https://github.com/open-mpi/ompi.git
$ cd ompi/
$ git submodule update --init --recursive
$ ./autogen.pl
$ mkdir build
$ cd build/
$ ../configure --prefix=<install_path> --with-ucx=<path_to_ucx> --disable-man-pages
$ make -j
$ make install
UCX v1.10.1 was built from a tarball.
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
$ git submodule status
256b1f5dec15386990b57c7fc4c7ecd67a6f1e27 3rd-party/openpmix (v1.1.3-3014-g256b1f5)
53e80245ad007550aee18c3fd176e030a173a16b 3rd-party/prrte (dev-31257-g53e8024)
Please describe the system on which you are running
- Operating system/version: Red Hat Enterprise Linux 7 (3.10.0-957.21.3.el7.x86_64)
- Computer hardware: Intel Xeon Platinum 8280 (Cascadelake)
- Network type: Intel Omni-Path
Details of the problem
When calling MPI_Compare_and_swap()
in "flat MPI" model, where multiple nodes are used and multiple processes are running on each node, it causes segfault with rdma
osc.
Segfault did not occur with a single node or with multiple nodes having one process per node.
Minimum code example to reproduce segfault:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <mpi.h>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
uint64_t* lock;
MPI_Win win;
MPI_Win_allocate(sizeof(uint64_t), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &lock, &win);
MPI_Win_lock_all(0, win);
*lock = 0;
MPI_Barrier(MPI_COMM_WORLD);
const uint64_t one = 1;
const uint64_t zero = 0;
uint64_t result;
MPI_Compare_and_swap(&one, &zero, &result, MPI_UINT64_T, 0, 0, win);
MPI_Win_flush(0, win);
printf("%ld\n", result);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Win_unlock_all(win);
MPI_Finalize();
return 0;
}
This program first initializes lock
as 0, and then all processes issue MPI_Compare_and_swap()
to lock
at rank 0.
Expected behavior is that only one process gets result = 0
.
Running the above program with 4 processes on 2 nodes:
$ mpirun --mca osc rdma -n 4 -N 2 ./a.out
Output:
[cx0001:24799:0:24799] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[cx0001:24800:0:24800] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
==== backtrace (tid: 24800) ====
0 0x00000000000587b5 ucs_debug_print_backtrace() <HOME>/ucx-1.10.1/build/src/ucs/../../../src/ucs/debug/debug.c:656
1 0x00000000000b9e05 mca_btl_ofi_afop() ???:0
2 0x000000000023f176 ompi_osc_rdma_lock_all_atomic() ???:0
3 0x00000000000f81c6 MPI_Win_lock_all() ???:0
4 0x00000000004009f1 main() test_cas.c:13
5 0x00000000000223d5 __libc_start_main() ???:0
6 0x00000000004008e9 _start() ???:0
=================================
a.out:24800 terminated with signal 11 at PC=2b59ed535e05 SP=7fffcfd3bb00. Backtrace:
<ompi_install_path>/lib/libopen-pal.so.0(mca_btl_ofi_afop+0x105)[0x2b59ed535e05]
<ompi_install_path>/lib/libmpi.so.0(ompi_osc_rdma_lock_all_atomic+0x326)[0x2b59ecb77176]
<ompi_install_path>/lib/libmpi.so.0(PMPI_Win_lock_all+0x96)[0x2b59eca301c6]
./a.out[0x4009f1]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b59ed0d13d5]
./a.out[0x4008e9]
==== backtrace (tid: 24799) ====
0 0x00000000000587b5 ucs_debug_print_backtrace() <HOME>/ucx-1.10.1/build/src/ucs/../../../src/ucs/debug/debug.c:656
1 0x00000000000b9e05 mca_btl_ofi_afop() ???:0
2 0x000000000023f176 ompi_osc_rdma_lock_all_atomic() ???:0
3 0x00000000000f81c6 MPI_Win_lock_all() ???:0
4 0x00000000004009f1 main() test_cas.c:13
5 0x00000000000223d5 __libc_start_main() ???:0
6 0x00000000004008e9 _start() ???:0
=================================
a.out:24799 terminated with signal 11 at PC=2b3c2c582e05 SP=7ffe75a10190. Backtrace:
<ompi_install_path>/lib/libopen-pal.so.0(mca_btl_ofi_afop+0x105)[0x2b3c2c582e05]
<ompi_install_path>/lib/libmpi.so.0(ompi_osc_rdma_lock_all_atomic+0x326)[0x2b3c2bbc4176]
<ompi_install_path>/lib/libmpi.so.0(PMPI_Win_lock_all+0x96)[0x2b3c2ba7d1c6]
./a.out[0x4009f1]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3c2c11e3d5]
./a.out[0x4008e9]
Running with -n 4 -N 1
(one process per node) and -n 4 -N 4
(only one node) did not cause segfault.
Metadata
Metadata
Assignees
Labels
No labels