Skip to content

Expand CUDA support and fix documentation to account for all cuda dependent components. #12279

Open
@christgau

Description

@christgau

Background information

What version of Open MPI are you using?

v5.0.1

Describe how Open MPI was installed

Open MPI was installed from Github release tarball. Configuration was done using this command line:

../configure \
        --prefix="${prefix_dir}" \
        --without-psm2 \
        --without-ofi \
        --with-lustre \
        --with-slurm \
        --with-pmix \
        --with-ucx="${UCX_DIR}" \
        --with-cuda="${CUDA_ROOT}" \
        --with-cuda-libdir="${CUDA_ROOT}/lib64/stubs" \
        --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda,coll-cuda

Note that I added coll-cuda to the list of mca-dsos. I'm not sure if it is intentionally missing in the documentation. I also tried without coll-cuda first, but with the same outcome.

CUDA Toolkit version 12.3 was installed in CUDA_ROOT. UCX was built against that CUDA toolkit. On cluster nodes with the drivers installed, ucx_info -d reports the relevant CUDA and gdrcopy transports.

Remark: The host used for compilation has the CUDA toolkit and runtime installed, but not the driver. So using stubs appears to be the way to go in that case (see #12264)

Please describe the system on which you are running

  • Operating system/version: Rocky Linux 8.8
  • Computer hardware: Intel Xeon
  • Network type: InfiniBand

Details of the problem

With Open MPI 4.1.4,I was able to build it such that one could compile and run binaries without the need of having the CUDA toolkit, runtime and drivers available on the node in use. However, with 5.0.1 configured as shown above, the linker warns about missing libcudart when building a binary (even a basic MPI_Init/MPI_Finalize program):

#include <stdio.h>
#include <stdlib.h>

#include "mpi.h"

int main(int argc, char* argv[])
{
        MPI_Init(&argc, &argv);
        MPI_Finalize();

        return EXIT_SUCCESS;
}
$ mpicc -show hw.c -o hw
gcc hw.c -o hw -I/path/to/openmpi/include -pthread -L/path/to/openmpi/lib -Wl,-rpath -Wl,/path/to/openmpi/lib -Wl,--enable-new-dtags -lmpi
$ mpicc hw.c -o hw
/usr/bin/ld: warning: libcudart.so.12, needed by /path/to/openmpi/lib/libmpi.so, not found (try using -rpath or -rpath-link)
$ mpirun -n1 ./hw
./hw: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory
$ ldd hw
        linux-vdso.so.1 (0x00007ffc747da000)
        libmpi.so.40 => /path/to/openmpi/lib/libmpi.so.40 (0x000014ae23df9000)
        [...]
        libcudart.so.12 => not found

With 4.1.4 I am able to compile and launch without those warnings/errors while having a CUDA-aware MPI. For 4.1.4 it was not the case that libmpi depends on libcudart, although 4.1.4 was configured using --with-cuda=....

If I got the SC'23 BoF slides correct, I understand that with 5.x Open MPI intends to integrate (link?) plugins directly into libmpi. But with the enable-mca-dso configure option I tried to put all CUDA related components into DSOs and thus away from libmpi. Nevertheless, libmpi has libcudart as a shared library dependency (see above). I also checked the symbols which libmpi needs but it does not appear to require any stuff from libcudart:

$ nm -D /path/to/openmpi/lib/libmpi.so.40 | grep -i cuda
000000000029cdb0 T mca_pml_ob1_rdma_cuda_btls
00000000002c7e20 T MPIX_Query_cuda_support
                 U opal_built_with_cuda_support
                 U opal_cuda_support

So it appears to me that libmpi unnecessarily depends on libcudart. Is there some bug in the configure/compilation process or is it not possible anymore to build Open MPI libraries such that one can compile applications without CUDA runtime libraries being available? Given the dependency to libcudart of libmpi the statement from the documentation

Open MPI supports building with CUDA libraries and running on systems without CUDA libraries or hardware.

does not appear to apply here. Or is there something wrong on my side?

Btw: The test program from the documentation may also deserve a call to MPI_Init in case one follows the DSO approach. Otherwise, it is reported that there is no CUDA support (using OMPI v5.0.1 with CUDA toolkit 12.3 available for compilation/execution):

$ ./check  # with MPI_Init
Compile time check:
This MPI library has CUDA-aware support.
Run time check:
This MPI library has CUDA-aware support.
$ ./check-no-init # without MPI_Init
Compile time check:
This MPI library has CUDA-aware support.
Run time check:
This MPI library does not have CUDA-aware support.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions