Skip to content

Is UCX working with MPI-Sessions? #12566

Closed
@TimEllersiek

Description

UCX and MPI-Sessions

When I try to use OpenMPI with USX on our small University-Cluster I got an error message
saying that MPI-Session Features are not supported by UCX (The Cluster uses an Infiniband connection).
However, when I install it on my Local-Machine (ArchLinux)
all seems to work fine. So I'm wondering whether the MPI-Sessions are supported by UCX or not?

Source Code (main.c):

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void function_my_session_errhandler(MPI_Session *foo, int *bar, ...) {
    fprintf(stderr, "my error handler called here with error %d\n", *bar);
}

void function_check_print_error(char *format, int rc) {
    if (MPI_SUCCESS != rc) {
        fprintf(stderr, format, rc);
        abort();
    }
}

int main(int argc, char *argv[]) {
    MPI_Session session;
    MPI_Errhandler errhandler;
    MPI_Group group;
    MPI_Comm comm_world, comm_self;
    MPI_Info info;
    int rc, npsets, one = 1, sum;

    rc = MPI_Session_create_errhandler(function_my_session_errhandler, &errhandler);
    function_check_print_error("Error handler creation failed with rc = %d\n", rc);


    rc = MPI_Info_create(&info);
    function_check_print_error("Info creation failed with rc = %d\n", rc);

    rc = MPI_Info_set(info, "thread_level", "MPI_THREAD_MULTIPLE");
    function_check_print_error("Info key/val set failed with rc = %d\n", rc);

    rc = MPI_Session_init(info, errhandler, &session);
    function_check_print_error("Session initialization failed with rc = %d\n", rc);

    rc = MPI_Session_get_num_psets(session, MPI_INFO_NULL, &npsets);
    function_check_print_error(" with rc = %d\n", rc);

    for (int i = 0; i < npsets; i++) {
        int psetlen = 0;
        char pset_name[256];

        MPI_Session_get_nth_pset(session, MPI_INFO_NULL, i, &psetlen, NULL);
        MPI_Session_get_nth_pset(session, MPI_INFO_NULL, i, &psetlen, pset_name);
        fprintf(stderr, "  PSET %d: %s (len: %d)\n", i, pset_name, psetlen);
    }


   
    rc = MPI_Group_from_session_pset(session, "mpi://WORLD", &group);
    function_check_print_error("Could not get a group for mpi://WORLD. rc = %d\n", rc);

    rc = MPI_Comm_create_from_group(group, "my_world", MPI_INFO_NULL, MPI_ERRORS_RETURN, &comm_world);
    function_check_print_error("Could not create Communicator my_world. rc = %d\n", rc);

    MPI_Group_free(&group);

    MPI_Allreduce(&one, &sum, 1, MPI_INT, MPI_SUM, comm_world);

    fprintf(stderr, "World Comm Sum (1): %d\n", sum);

    rc = MPI_Group_from_session_pset(session, "mpi://SELF", &group);
    function_check_print_error("Could not get a group for mpi://SELF. rc = %d\n", rc);

    MPI_Comm_create_from_group(group, "myself", MPI_INFO_NULL, MPI_ERRORS_RETURN, &comm_self);
    MPI_Group_free(&group);

    MPI_Allreduce(&one, &sum, 1, MPI_INT, MPI_SUM, comm_self);

    fprintf(stderr, "Self Comm Sum (1): %d\n", sum);


    MPI_Errhandler_free(&errhandler);
    MPI_Info_free(&info);
    MPI_Comm_free(&comm_world);
    MPI_Comm_free(&comm_self);
    MPI_Session_finalize(&session);

    return 0;
}

Commands used to compile and run

mpicc \-o main main.c
mpirun -np 1 -mca osc ucx out/main

Console Output Uni-Cluster

$ mpirun -np 1 -mca pml ucx main
  PSET 0: mpi://WORLD (len: 12)
  PSET 1: mpi://SELF (len: 11)
  PSET 2: mpix://SHARED (len: 14)
Could not create Communicator my_world. rc = 52
[nv46:97180] *** Process received signal ***
[nv46:97180] Signal: Aborted (6)
[nv46:97180] Signal code:  (-6)
--------------------------------------------------------------------------
Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Comm_from_group/MPI_Intercomm_from_groups
  Reason:       The PML being used - ucx - does not support MPI sessions related features
--------------------------------------------------------------------------
[nv46:97180] [ 0] /usr/lib/libc.so.6(+0x3c770)[0x72422de41770]
[nv46:97180] [ 1] /usr/lib/libc.so.6(+0x8d32c)[0x72422de9232c]
[nv46:97180] [ 2] /usr/lib/libc.so.6(gsignal+0x18)[0x72422de416c8]
[nv46:97180] [ 3] /usr/lib/libc.so.6(abort+0xd7)[0x72422de294b8]
[nv46:97180] [ 4] main(+0x12f4)[0x6239e33802f4]
[nv46:97180] [ 5] main(+0x1585)[0x6239e3380585]
[nv46:97180] [ 6] /usr/lib/libc.so.6(+0x25cd0)[0x72422de2acd0]
[nv46:97180] [ 7] /usr/lib/libc.so.6(__libc_start_main+0x8a)[0x72422de2ad8a]
[nv46:97180] [ 8] main(+0x1165)[0x6239e3380165]
[nv46:97180] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 97180 on node nv46 exited on
signal 6 (Aborted).

Console Output Local:

$ mpirun -np 1 -mca osc ucx main
  PSET 0: mpi://WORLD (len: 12)
  PSET 1: mpi://SELF (len: 11)
  PSET 2: mpix://SHARED (len: 14)
  World Comm Sum (1): 1
  Self Comm Sum (1): 1

Installation

Small Uni-Cluster

UCX Output

Output von configure-release:

[[
configure:           ASAN check:   no
configure:         Multi-thread:   disabled
configure:            MPI tests:   disabled
configure:          VFS support:   yes
configure:        Devel headers:   no
configure: io_demo CUDA support:   no
configure:             Bindings:   < >
configure:          UCS modules:   < fuse >
configure:          UCT modules:   < ib rdmacm cma >
configure:         CUDA modules:   < >
configure:         ROCM modules:   < >
configure:           IB modules:   < >
configure:          UCM modules:   < >
configure:         Perf modules:   < >
]]

Output make install:

$UCXFOLDER/myinstall/bin/ucx_info -v
[[
# Library version: 1.17.0
# Library path: ${HOME}/itoyori/ucx/myinstall/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch 'master', revision a48ad8f
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=${HOME}/itoyori/ucx/myinstall --without-go
]]

OpenMPI

Output von configure:

[[
Open MPI configuration:
-----------------------
Version: 5.0.3
MPI Standard Version: 3.1
Build MPI C bindings: yes
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
Build MPI Java bindings (experimental): no
Build Open SHMEM support: yes
Debug build: no
Platform file: (none)
 
Miscellaneous
-----------------------
Atomics: GCC built-in style atomics
Fault Tolerance support: mpi
HTML docs and man pages: installing packaged docs
hwloc: external
libevent: external
Open UCC: no
pmix: external
PRRTE: external
Threading Package: pthreads
 
Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no (not found)
Open UCX: yes
OpenFabrics OFI Libfabric: yes (pkg-config: default search paths)
Portals4: no (not found)
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes
 
Accelerators
-----------------------
CUDA support: no
ROCm support: no
 
OMPIO File Systems
-----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no (not found)
Lustre: no (not found)
PVFS2/OrangeFS: no
]]

Local

UCX Output

Output von configure-release:

configure: =========================================================
configure: UCX build configuration:
configure:         Build prefix:   ${HOME}/ucx/myinstall
configure:    Configuration dir:   ${prefix}/etc/ucx
configure:   Preprocessor flags:   -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src
configure:           C compiler:   gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement
configure:         C++ compiler:   g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
configure:         Multi-thread:   disabled
configure:            MPI tests:   disabled
configure:          VFS support:   yes
configure:        Devel headers:   no
configure: io_demo CUDA support:   no
configure:             Bindings:   < >
configure:          UCS modules:   < fuse >
configure:          UCT modules:   < cma >
configure:         CUDA modules:   < >
configure:         ROCM modules:   < >
configure:           IB modules:   < >
configure:          UCM modules:   < >
configure:         Perf modules:   < >
configure: =========================================================

Output make install:

$UCXFOLDER/myinstall/bin/ucx_info -v
# Library version: 1.16.0
# Library path: ${HOME}/ucx/myinstall/lib/libucs.so.0
# API headers version: 1.16.0
# Git branch '', revision e4bb802
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=${HOME}/ucx/myinstall --without-go

OpenMPI Output

Output von configure:

Open MPI configuration:
-----------------------
Version: 5.0.3
MPI Standard Version: 3.1
Build MPI C bindings: yes
Build MPI Fortran bindings: no
Build MPI Java bindings (experimental): no
Build Open SHMEM support: yes
Debug build: no
Platform file: (none)
 
Miscellaneous
-----------------------
Atomics: GCC built-in style atomics
Fault Tolerance support: mpi
HTML docs and man pages: installing packaged docs
hwloc: internal
libevent: external
Open UCC: no
pmix: internal
PRRTE: internal
Threading Package: pthreads
 
Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no (not found)
Open UCX: yes
OpenFabrics OFI Libfabric: no (not found)
Portals4: no (not found)
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes
 
Accelerators
-----------------------
CUDA support: no
ROCm support: no
 
OMPIO File Systems
-----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no (not found)
Lustre: no (not found)
PVFS2/OrangeFS: no

MPI and UCX Installation

Ordnerstruktur:

${HOME}/ucx
${HOME}/openmpi-5.0.3

Install OpenUCX

cd ${HOME}
git clone https://github.com/openucx/ucx.git
cd ucx
git checkout v1.16.0
export UCXFOLDER=${HOME}/ucx
./autogen.sh
./contrib/configure-release --prefix=$UCXFOLDER/myinstall --without-go

Install:

make -j32
make install

OpenMPI

cd ${HOME}
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.3.tar.gz
tar xfvz openmpi-5.0.3.tar.gz
export MPIFOLDER=${HOME}/openmpi-5.0.3
cd $MPIFOLDER
./configure --disable-io-romio --with-io-romio-flags=--without-ze --disable-sphinx --prefix="$MPIFOLDER/myinstall" --with-ucx="$UCXFOLDER/myinstall" 2>&1 | tee config.out

Install:

make -j32 all 2>&1 | tee make.out
make install 2>&1 | tee install.out
export OMPI="${MPIFOLDER}/myinstall"
export PATH=$OMPI/bin:$PATH
export LD_LIBRARY_PATH=$OMPI/lib:$LD_LIBRARY_PATH

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions