improve reduction tests and fix reduction #139

cwpearson · 2025-01-29T16:41:03Z

Breaking this up into multiple PRs

dssgabriel · 2025-02-10T18:33:22Z

Using the workflow described in the Testing section of our docs, the Reduce test still fails (although with a different value than previously: was 10, now 65536).

Environment:

GCC 11.5
CUDA 12.6
OpenMPI 5.0.6
Kokkos develop branch (fresh pull & install)

3: [----------] 4 tests from Reduce/0, where TypeParam = int
3: [ RUN      ] Reduce/0.1D_contig_contig
3: /tmp/kokkos-comm/unit_tests/mpi/test_reduce.cpp:68: Failure
3: Expected equality of these values:
3:   errs
3:     Which is: 65536
3:   0
3:
3: (rank 0 failed)
3: [UN00317885-UNAL:00000] *** An error occurred in MPI_Reduce
3: [UN00317885-UNAL:00000] *** reported by process [3643342849,0]
3: [UN00317885-UNAL:00000] *** on communicator MPI_COMM_WORLD
3: [UN00317885-UNAL:00000] *** MPI_ERR_TRUNCATE: message truncated
3: [UN00317885-UNAL:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
3: [UN00317885-UNAL:00000] ***    and MPI will try to terminate your MPI job as well)
3: --------------------------------------------------------------------------
3: prterun has exited due to process rank 0 with PID 0 on node UN00317885-UNAL calling
3: "abort". This may have caused other processes in the application to be
3: terminated by signals sent by prterun (as reported here).
3: --------------------------------------------------------------------------
3:
3/3 Test #3: test-main ........................***Failed    0.62 sec

cwpearson · 2025-02-13T21:01:23Z

Ah this is an intermittent failure on my machine.

cwpearson · 2025-02-13T23:16:27Z

I think something funky is/was going on in our test framework. If I break them out into separate binaries all the tests pass under OpenMPI, but some of the Send/Recv tests fail under MPICH. I wonder if something is subtly wrong with some of our implementation and it was getting MPI into a bad state.

cwpearson · 2025-02-14T20:28:08Z

@dssgabriel can you try again? We were missing a fence in one of our send implementations:

6d692f8 (#139)

and I think it was getting MPI into a bad state.

To help avoid this kind of problem in the future, I broke categories of tests out into different binaries. However, on my machine, with CUDA 12.2, OpenMPI 5.0.6 / MPICH 4.2.0, all the tests pass now regardless of whether they're in a single binary or multiple binaries.

cwpearson · 2025-02-14T20:31:04Z

I also discovered that GPU-aware Send/Recv seemed to be broken with an MPICH that used ch4 + libfabric. It works under MPICH configured with ch4 + ucx.

Intermittently the MPI_Recv would complete too early (e.g., before the matching MPI_Send was even called) and return some garbage data.

dssgabriel · 2025-02-14T20:42:44Z

Sure @cwpearson, checking out the latest push now 👍

dssgabriel · 2025-02-14T21:12:15Z

All tests are passing with OpenMPI 5.0.6 + CUDA 12.6 ✔️

However, I have some failing unit tests with MPICH 4.2.3 (built with libfabric, see spack spec at the bottom) + CUDA 12.6:

78% tests passed, 2 tests failed out of 9

Total Test time (real) =   4.18 sec

The following tests FAILED:
          5 - test-sendrecv (Failed)
          6 - test-isendrecv (Failed)

Test 5 output:

5: [ RUN      ] MpiSendRecv/2.1D_contig_standard
5: Abort(273299215) on node 1 (rank 1 in comm 0): Fatal error in internal_Recv: Other MPI error, error stack:
5: internal_Recv(62556)...........: MPI_Recv(buf=0x7f822d0b0080, count=113, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, status=0x1) failed
5: MPIR_Wait(785).................:
5: MPIDI_IPCI_handle_lmt_recv(230):
5: MPIDI_GPU_ipc_handle_map(355)..:  gpu_ipc_handle_map failed
5/9 Test #5: test-sendrecv ....................***Failed    0.50 sec

Test 6 output

I am not sure why it reports that this test is failing as 32/48 pass correctly and the remaining 16 are just skipped.

6: [----------] Global test environment tear-down
6: [==========] 48 tests from 8 test suites ran. (21 ms total)
6: [  PASSED  ] 32 tests.
6: [  SKIPPED ] 16 tests, listed below:
6: [  SKIPPED ] IsendRecv/0.1D_contig_ready
6: [  SKIPPED ] IsendRecv/0.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/1.1D_contig_ready
6: [  SKIPPED ] IsendRecv/1.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/2.1D_contig_ready
6: [  SKIPPED ] IsendRecv/2.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/3.1D_contig_ready
6: [  SKIPPED ] IsendRecv/3.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/4.1D_contig_ready
6: [  SKIPPED ] IsendRecv/4.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/5.1D_contig_ready
6: [  SKIPPED ] IsendRecv/5.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/6.1D_contig_ready
6: [  SKIPPED ] IsendRecv/6.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/7.1D_contig_ready
6: [  SKIPPED ] IsendRecv/7.1D_noncontig_ready
6/9 Test #6: test-isendrecv ...................***Failed    0.50 sec

Perf tests are also failing with MPICH:

1: benchmark_osu_latency_Kokkos_Comm_mpi_sendrecv/2097152/manual_time         17.3 us         20.2 us        40831 bytes=4.1943M
1: Abort(944387855) on node 1 (rank 1 in comm 0): Fatal error in internal_Recv: Other MPI error, error stack:
1: internal_Recv(62556)...........: MPI_Recv(buf=0x7f17ee200080, count=16777216, MPI_CHAR, 0, 0, MPI_COMM_WORLD, status=0x1) failed
1: MPIR_Wait(785).................:
1: MPIDI_IPCI_handle_lmt_recv(230):
1: MPIDI_GPU_ipc_handle_map(355)..:  gpu_ipc_handle_map failed
1/1 Test #1: perf-test-main ...................***Failed   21.91 sec

0% tests passed, 1 tests failed out of 1

`spack spec` for MPICH on my machine:

[+]  mpich@4.2.3%gcc@11.5.0~argobots+cuda+fortran+hwloc+hydra~level_zero+libxml2+pci~rocm+romio~slurm~vci~verbs+wrapperrpath~xpmem build_system=autotools cuda_arch=89 datatype-engine=auto device=ch4 netmod=ofi pmi=default arch=linux-almalinux9-skylake
[+]      ^cuda@12.6.3%gcc@11.5.0~allow-unsupported-compilers~dev build_system=generic arch=linux-almalinux9-skylake
[e]      ^findutils@4.8.0%gcc@11.5.0 build_system=autotools patches=440b954 arch=linux-almalinux9-skylake
[+]      ^gcc-runtime@11.5.0%gcc@11.5.0 build_system=generic arch=linux-almalinux9-skylake
[e]      ^glibc@2.34%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[e]      ^gmake@4.3%gcc@11.5.0~guile build_system=generic patches=599f134 arch=linux-almalinux9-skylake
[+]      ^hwloc@2.11.1%gcc@11.5.0~cairo~cuda~gl~level_zero~libudev+libxml2~nvml~opencl+pci~rocm build_system=autotools libs=shared,static arch=linux-almalinux9-skylake
[+]          ^ncurses@6.5%gcc@11.5.0~symlinks+termlib abi=none build_system=autotools patches=7a351bc arch=linux-almalinux9-skylake
[+]      ^libfabric@1.22.0%gcc@11.5.0~cuda~debug~kdreg~level_zero~uring build_system=autotools fabrics=sockets,tcp,udp arch=linux-almalinux9-skylake
[+]      ^libpciaccess@0.17%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]          ^util-macros@1.20.1%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]      ^libxml2@2.13.5%gcc@11.5.0~http+pic~python+shared build_system=autotools arch=linux-almalinux9-skylake
[+]          ^libiconv@1.17%gcc@11.5.0 build_system=autotools libs=shared,static arch=linux-almalinux9-skylake
[+]          ^xz@5.4.6%gcc@11.5.0~pic build_system=autotools libs=shared,static arch=linux-almalinux9-skylake
[+]          ^zlib-ng@2.2.3%gcc@11.5.0+compat+new_strategies+opt+pic+shared build_system=autotools arch=linux-almalinux9-skylake
[e]      ^pkgconf@1.7.3%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]      ^yaksa@0.3%gcc@11.5.0+cuda~level_zero~rocm build_system=autotools cuda_arch=none arch=linux-almalinux9-skylake
[e]          ^autoconf@2.69%gcc@11.5.0 build_system=autotools patches=7793209 arch=linux-almalinux9-skylake
[e]          ^automake@1.16.2%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[e]          ^libtool@2.4.6%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[e]          ^m4@1.4.19%gcc@11.5.0+sigsegv build_system=autotools patches=9dc5fbd,bfdffa7 arch=linux-almalinux9-skylake
[+]          ^python@3.13.1%gcc@11.5.0+bz2+ctypes+dbm~debug+libxml2+lzma~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic arch=linux-almalinux9-skylake
[+]              ^bzip2@1.0.8%gcc@11.5.0~debug~pic+shared build_system=generic arch=linux-almalinux9-skylake
[e]                  ^diffutils@3.7%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]              ^expat@2.6.4%gcc@11.5.0+libbsd build_system=autotools arch=linux-almalinux9-skylake
[+]                  ^libbsd@0.12.2%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]                      ^libmd@1.0.4%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]              ^gdbm@1.24%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[e]              ^gettext@0.21%gcc@11.5.0+bzip2+curses+git~libunistring+libxml2+pic+shared+tar+xz build_system=autotools arch=linux-almalinux9-skylake
[+]              ^libffi@3.4.6%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[e]              ^openssl@3.2.2%gcc@11.5.0~docs+shared build_system=generic certs=mozilla arch=linux-almalinux9-skylake
[+]              ^readline@8.2%gcc@11.5.0 build_system=autotools patches=1ea4349,24f587b,3d9885e,5911a5b,622ba38,6c8adf8,758e2ec,79572ee,a177edc,bbf97f1,c7b45ff,e0013d9,e065038 arch=linux-almalinux9-skylake
[+]              ^sqlite@3.46.0%gcc@11.5.0+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools arch=linux-almalinux9-skylake
[+]              ^util-linux-uuid@2.40.2%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake

cwpearson · 2025-02-14T21:48:45Z

That's kind of similar to what I was seeing.

cwpearson · 2025-02-14T22:11:58Z

This reminds me of the sorts of things that were going wrong when Kokkos enabled async CUDA malloc by default, which was reverted in 4.5.00.

cwpearson · 2025-02-14T22:29:18Z

That error is in the contiguous case where we really don't do anything beyond calling MPI_Send and MPI_Recv. I'm a bit stumped.

cwpearson · 2025-02-17T17:22:09Z

I can reproduce the MPICH problems in pure CUDA + MPI:

main.cu

#include <mpi.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <iostream>

/*
Author: Carl Pearson (cwpears@sandia.gov)

Basically a send/recv test with two ranks.
Send-side sets buf(i) = i
Recv side checks that that's what it gets.

Optionally, the buffers can be offset from the start of the allocation by a configurable
alignment amount, i.e. the allocated buffer is larger, and we don't use offset 0 as the
beginning of the send/recv buffer

There's a bit of asymmetry that shouldn't matter: both ranks allocate the recv buffer,
only the send side allocates the send buffer.
The reason this is done is because the MPICH errors seem sensitive to how many
CUDA allocations are in the code, and this triggers the error case.

In MPICH 4.2.3 + ofi, we get some errors:
* With a 0 alignment offset, eventually the recv side gets garbage.
* With a 128 alignment offset, we get IPC handle mapping errors.
This works in Open MPI 5.0.5
*/

// Macro to check for CUDA errors
#define CUDA(call)                                           \
    do {                                                                 \
        cudaError_t err = call;                                          \
        if (err != cudaSuccess) {                                        \
            std::cerr << "CUDA error in file '" << __FILE__              \
                      << "' in line " << __LINE__ << ": "                \
                      << cudaGetErrorString(err) << " (" << err << ")"   \
                      << std::endl;                                      \
            exit(EXIT_FAILURE);                                          \
        }                                                                \
    } while (0)

// a(i) = i
template <typename Scalar>
__global__ void init_array(Scalar* a, int sz) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < sz) {
    a[i] = Scalar(i);
  }
}

// check a(i) = i
template <typename Scalar>
__global__ void check_array(const Scalar* a, int sz, int* errs) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < sz && a[i] != Scalar(i)) {
    atomicAdd(errs, 1);
  }
}

// get a builting MPI_Datatype for int32_t, int64_t, float
template <typename Scalar>
MPI_Datatype mpi_type() {
  if constexpr (std::is_same_v<Scalar, int32_t>) {
    return MPI_INT;
  } else if constexpr (std::is_same_v<Scalar, int64_t>) {
    return MPI_LONG_LONG;
  } else if constexpr (std::is_same_v<Scalar, float>) {
    return MPI_FLOAT; 
  } else {
    static_assert(std::is_void_v<Scalar>, "unsupported type");
  }
}

// if alignment is 0, return ptr
// else, return the next aligned version of ptr+1
void* align_next(void* ptr, std::size_t alignment) {
  if (0 == alignment) return ptr;
  std::uintptr_t p = reinterpret_cast<std::uintptr_t>(ptr);
  // would be p + alignment - 1 if we weren't getting the next one
  std::uintptr_t aligned_p = (p + alignment) & ~(alignment - 1); 
  return reinterpret_cast<void*>(aligned_p);
}

template <typename Scalar>
void run_test(int num_elements, int alignment, bool use_ssend) {

  // get a string name of the Scalar type
  const char *name;
  if constexpr (std::is_same_v<Scalar, int32_t>) {
    name = "int32_t";
  } else if constexpr (std::is_same_v<Scalar, float>) {
    name = "float";
  } else if constexpr (std::is_same_v<Scalar, int64_t>) {
    name = "int64_t";
  } else {
    static_assert(std::is_void_v<Scalar>, "unsupported type");
  }

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  if (2 != size) {
    MPI_Abort(MPI_COMM_WORLD, 1);
  }
  
  if (0 == rank) {
    std::cerr << __FILE__ << ":" << __LINE__ << " test: " << num_elements << " " << name << "\n";
  }

  Scalar* d_recv_buf;
  int* d_errs;

  size_t buffer_size = num_elements * sizeof(Scalar) + alignment;
  
  CUDA(cudaMalloc(&d_recv_buf, buffer_size));
  CUDA(cudaMalloc(&d_errs, sizeof(int)));
  CUDA(cudaMemset(d_errs, 0, sizeof(int)));
  Scalar* recv_buf = reinterpret_cast<Scalar*>(align_next(d_recv_buf, alignment));
    
  if (rank == 0) {
    Scalar* d_send_buf;
    CUDA(cudaMalloc(&d_send_buf, buffer_size));
    Scalar* send_buf = reinterpret_cast<Scalar*>(align_next(d_send_buf, alignment));
    init_array<<<(num_elements + 255) / 256, 256>>>(send_buf, num_elements);
    CUDA(cudaDeviceSynchronize());

    std::cerr << __FILE__ << ":" << __LINE__ << " send: " << d_send_buf << " " << send_buf << "\n";
    if (use_ssend) {
      MPI_Ssend(send_buf, num_elements, mpi_type<Scalar>(), 1, 0, MPI_COMM_WORLD);
    } else {
      MPI_Send(send_buf, num_elements, mpi_type<Scalar>(), 1, 0, MPI_COMM_WORLD);
    }

    CUDA(cudaFree(d_send_buf));
  } else if (rank == 1) {

    int h_errs = 0;

    std::cerr << __FILE__ << ":" << __LINE__ << " recv: " << d_recv_buf << " " << recv_buf << "\n";
    MPI_Recv(recv_buf, num_elements, mpi_type<Scalar>(), 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    check_array<<<(num_elements + 255) / 256, 256>>>(recv_buf, num_elements, d_errs);
    CUDA(cudaMemcpy(&h_errs, d_errs, sizeof(int), cudaMemcpyDeviceToHost));
    if (h_errs) {
      std::cerr << __FILE__ << ":" << __LINE__ << " h_errs=" << h_errs << "\n";
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }

  CUDA(cudaFree(d_recv_buf));
  CUDA(cudaFree(d_errs));
}

template<typename Scalar>
void run_test() {
  const int alignment = 128;
  
  for (size_t _ : {0,1,2}) {
    for (size_t n : {113, 16, 8, 4, 2, 1}) {
        MPI_Barrier(MPI_COMM_WORLD);
        run_test<Scalar>(n, alignment, false /* MPI_Send */);
        MPI_Barrier(MPI_COMM_WORLD);
    }
    // for (size_t n : {113, 16, 8, 4, 2, 1}) {
    //   run_test<Scalar>(n, alignment, true /*MPI_Ssend*/);
    // }
  }
}

int main(int argc, char **argv) {
    MPI_Init(&argc, &argv);
    run_test<int>();
    run_test<int64_t>();
    run_test<float>();
    MPI_Finalize();
    return 0;
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.23 FATAL_ERROR)

project(MpichOfi LANGUAGES CXX CUDA)

find_package(MPI REQUIRED)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

enable_language(CUDA)

add_executable(main main.cu)

target_link_libraries(main MPI::MPI_CXX)

set_target_properties(main PROPERTIES
    CUDA_ARCHITECTURES "86"
)

With alignment=0 we eventually just get errors on the recv side.
With alignment=128 we get internal MPI errors in MPI_Recv:

Abort(206190351) on node 1 (rank 1 in comm 0): Fatal error in internal_Recv: Other MPI error, error stack:
internal_Recv(62556)...........: MPI_Recv(buf=0x70239de00080, count=113, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, status=0x1) failed
MPIR_Wait(785).................: 
MPIDI_IPCI_handle_lmt_recv(230): 
MPIDI_GPU_ipc_handle_map(355)..:  gpu_ipc_handle_map failed

@dssgabriel can you reproduce this CUDA+MPI case as well with that same MPICH you used before?

dssgabriel · 2025-02-17T20:50:52Z

@cwpearson I can reproduce and obtain the same crash output as you do (using the same CUDA + MPICH configuration I tested last week):

Abort(810170127) on node 1 (rank 1 in comm 0): Fatal error in internal_Recv: Other MPI error, error stack:
internal_Recv(62556)...........: MPI_Recv(buf=0x7fbaeb000080, count=113, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, status=0x1) failed
MPIR_Wait(785).................:
MPIDI_IPCI_handle_lmt_recv(230):
MPIDI_GPU_ipc_handle_map(355)..:  gpu_ipc_handle_map failed

Full output

Launch command is mpiexec -np 2 ./build/main:

/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000600 0x7ff6a3000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000600 0x7ff6a3000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000600 0x7ff6a3000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000800 0x7ff6a3000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000800 0x7ff6a3000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000800 0x7ff6a3000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000600 0x7ff6a3000680
Abort(810170127) on node 1 (rank 1 in comm 0): Fatal error in internal_Recv: Other MPI error, error stack:
internal_Recv(62556)...........: MPI_Recv(buf=0x7fbaeb000080, count=113, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, status=0x1) failed
MPIR_Wait(785).................:
MPIDI_IPCI_handle_lmt_recv(230):
MPIDI_GPU_ipc_handle_map(355)..:  gpu_ipc_handle_map failed

cwpearson · 2025-02-17T20:52:36Z

This is a possible bug in MPICH:

pmodels/mpich#7304

Either way that issue may help shed some light.

dssgabriel · 2025-02-17T21:33:30Z

As Ken Raffenetti mentioned in the answer to your issue, if I run with MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE=disabled then the MPICH internal error disappears.

I am not sure what this means for us and what needs fixing though...

Output

$ MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE=disabled mpirun -n 2 build/main

/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000800 0x7fdd43000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000800 0x7fdd43000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000800 0x7fdd43000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480

cwpearson · 2025-02-17T21:41:34Z

I don't have any particular knowledge about MPICH's development, but if they do conclude it's a bug they might be able to point us at an older version of MPICH that doesn't have it for us to test again. This will also be relevant for standing up GPU CI builds.

We could also test with that environment variable set, but disabling the IPC cache is probably not what you want if you care about latency.

cwpearson · 2025-02-17T21:44:37Z

In any case, let me tidy up the CUDA/MPI reproducer into a unit test, and we can maybe get this merged in.

dssgabriel · 2025-02-17T21:46:02Z

Sure! I might approve only tomorrow morning though, depending on when this is ready (almost 11pm here 😆)

cwpearson · 2025-02-18T17:11:18Z

Okay, the final conclusion is a combination of MPICH requiring dynamic linking against cudart, and also we need to link MPI before CUDA . I've incorporated both those changes into this PR.

dssgabriel · 2025-02-18T21:54:39Z

src/KokkosComm/CMakeLists.txt

+# gh/pmodels/mpich#7304: MPICH 4.2.3 and 4.3.0 (and maybe earlier) rely on being
+# linked before CUDA and friends, so link MPI before Kokkos.
+# TODO_CMAKE: in CMake 3.31, we can use LINK_LIBRARIES_STRATEGY to explicitly ask CMake not to
+# reorder how libraries are linked. In the mean time, we have to just hope (?).


Is this behavior MPICH-specific or does it also apply to Open MPI?

Open MPI worked either way in my tests.

dssgabriel

Looks good to me! Everything is passing on my end.

dssgabriel · 2025-02-18T22:06:50Z

Do we want to merge #130 first, or should we drop it and merge this directly instead? (as it includes the changes)

cedricchevalier19 · 2025-02-18T22:06:56Z

Okay, the final conclusion is a combination of MPICH requiring dynamic linking against cudart, and also we need to link MPI before CUDA . I've incorporated both those changes into this PR.

Is the order important with the dynamic linking? I am not totally sure that we can control the order in which the dynamic loader will proceed.

cedricchevalier19 · 2025-02-18T22:08:27Z

Do we want to merge #130 first, or should we drop it and merge this directly instead? (as it includes the changes)

I would rather merge the 2 PRs.

cwpearson · 2025-02-18T22:18:21Z

I think I can split this one up into some individual pieces too.

cwpearson · 2025-02-18T22:20:06Z

Okay, the final conclusion is a combination of MPICH requiring dynamic linking against cudart, and also we need to link MPI before CUDA . I've incorporated both those changes into this PR.

Is the order important with the dynamic linking? I am not totally sure that we can control the order in which the dynamic loader will proceed.

Unfortunately, it seems the order matters. This is discussed somewhat in pmodels/mpich#7304. CMake 3.31 seems to have an option where we can prevent libraries from being reordered. Prior to that, CMake generally seems to respect the order in which you attach link options to the target, but as you identify, I don't think it promises to do so.

dssgabriel · 2025-02-18T22:54:08Z

CMake 3.31 seems to have an option where we can prevent libraries from being reordered.

Alternatively we could require CMake 3.31? I believe this is the default version on Ubuntu 24.04 and CMake is pretty easy to install in userspace anyway.
Any reasons for us not to require its most recent version?

olivier-snl · 2025-02-18T23:04:24Z

Alternatively we could require CMake 3.31? I believe this is the default version on Ubuntu 24.04 and CMake is pretty easy to install in userspace anyway.
Any reasons for us not to require its most recent version?

I don't remember who pointed it out, but someone had recalled that Kokkos doesn't require such a new version and perhaps we wanted to align with Kokkos there. I'm not endorsing that, but I wish a remember who had that concern.

cwpearson · 2025-02-19T15:35:25Z

I'd prefer to stick with what Kokkos requires unless it's an absolute crisis.

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

cwpearson force-pushed the fix/cuda-reduction branch from fec0675 to 5d48ec8 Compare February 13, 2025 20:25

cwpearson requested a review from dssgabriel February 18, 2025 17:12

dssgabriel reviewed Feb 18, 2025

View reviewed changes

dssgabriel previously approved these changes Feb 18, 2025

View reviewed changes

cwpearson force-pushed the fix/cuda-reduction branch from 1bc50a6 to f5be308 Compare February 19, 2025 15:38

cwpearson mentioned this pull request Feb 19, 2025

Fix MPI implementation detection if multiple MPIs are installed #144

Merged

cwpearson added 22 commits February 19, 2025 16:50

reduce: cleanup and explicit execution space in tests

3d8c9c5

move test function into anonymous namespace

80cef15

contiguous and non-contiguous reduction tests

4b0b4c8

improve comment

c4a82c5

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

don't use MPI_PACKED in DeepCopy packer so we can call MPI_Reduce

e495033

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

mpi_type() -> mpi_type_v

90b6a4a

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

add missing fence in mpi::send

134cace

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

simplify error-check reduction

24fec77

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

print error messages from other ranks

47109d8

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

multiple test binaries

d409984

improve fence labels, call MPI_Recv directly

5d1d255

Use data_handle and span in mpi::send

f45be46

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

MPICH doesn't have mpi-ext.h header

5558a38

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

more thorough send/recv testing

a19f13e

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

add some fencing to unit tests to facilitate debugging

2b150a6

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

add missing test_utils.hpp

c9a7979

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

regression test for pmodels/mpich 7304

109136b

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

check CUDA return in unit test

ad28817

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

dynamically link cudart for MPICH in pure MPI unit tests

67b48d5

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

link MPI before Kokkos

246fc2c

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

only use --cudart shared when CUDA is enabled

e496036

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

only do CUDA-specific test when CUDA is enabled

8c958bd

Signed-off-by: Carl Pearson <cwpears@sandia.gov>

cwpearson dismissed dssgabriel’s stale review via 8c958bd February 19, 2025 23:52

cwpearson force-pushed the fix/cuda-reduction branch from f5be308 to 8c958bd Compare February 19, 2025 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve reduction tests and fix reduction #139

improve reduction tests and fix reduction #139

cwpearson commented Jan 29, 2025 •

edited

Loading

dssgabriel commented Feb 10, 2025 •

edited

Loading

cwpearson commented Feb 13, 2025

cwpearson commented Feb 13, 2025

cwpearson commented Feb 14, 2025 •

edited

Loading

cwpearson commented Feb 14, 2025 •

edited

Loading

dssgabriel commented Feb 14, 2025

dssgabriel commented Feb 14, 2025 •

edited

Loading

cwpearson commented Feb 14, 2025

cwpearson commented Feb 14, 2025 •

edited

Loading

cwpearson commented Feb 14, 2025 •

edited

Loading

cwpearson commented Feb 17, 2025 •

edited

Loading

dssgabriel commented Feb 17, 2025

cwpearson commented Feb 17, 2025

dssgabriel commented Feb 17, 2025

cwpearson commented Feb 17, 2025 •

edited

Loading

cwpearson commented Feb 17, 2025

dssgabriel commented Feb 17, 2025

cwpearson commented Feb 18, 2025

dssgabriel Feb 18, 2025

cwpearson Feb 18, 2025

dssgabriel left a comment

dssgabriel commented Feb 18, 2025

cedricchevalier19 commented Feb 18, 2025

cedricchevalier19 commented Feb 18, 2025

cwpearson commented Feb 18, 2025

cwpearson commented Feb 18, 2025

dssgabriel commented Feb 18, 2025 •

edited

Loading

olivier-snl commented Feb 18, 2025 •

edited

Loading

cwpearson commented Feb 19, 2025

improve reduction tests and fix reduction #139

Are you sure you want to change the base?

improve reduction tests and fix reduction #139

Conversation

cwpearson commented Jan 29, 2025 • edited Loading

dssgabriel commented Feb 10, 2025 • edited Loading

cwpearson commented Feb 13, 2025

cwpearson commented Feb 13, 2025

cwpearson commented Feb 14, 2025 • edited Loading

cwpearson commented Feb 14, 2025 • edited Loading

dssgabriel commented Feb 14, 2025

dssgabriel commented Feb 14, 2025 • edited Loading

cwpearson commented Feb 14, 2025

cwpearson commented Feb 14, 2025 • edited Loading

cwpearson commented Feb 14, 2025 • edited Loading

cwpearson commented Feb 17, 2025 • edited Loading

dssgabriel commented Feb 17, 2025

cwpearson commented Feb 17, 2025

dssgabriel commented Feb 17, 2025

cwpearson commented Feb 17, 2025 • edited Loading

cwpearson commented Feb 17, 2025

dssgabriel commented Feb 17, 2025

cwpearson commented Feb 18, 2025

dssgabriel Feb 18, 2025

Choose a reason for hiding this comment

cwpearson Feb 18, 2025

Choose a reason for hiding this comment

dssgabriel left a comment

Choose a reason for hiding this comment

dssgabriel commented Feb 18, 2025

cedricchevalier19 commented Feb 18, 2025

cedricchevalier19 commented Feb 18, 2025

cwpearson commented Feb 18, 2025

cwpearson commented Feb 18, 2025

dssgabriel commented Feb 18, 2025 • edited Loading

olivier-snl commented Feb 18, 2025 • edited Loading

cwpearson commented Feb 19, 2025

cwpearson commented Jan 29, 2025 •

edited

Loading

dssgabriel commented Feb 10, 2025 •

edited

Loading

cwpearson commented Feb 14, 2025 •

edited

Loading

cwpearson commented Feb 14, 2025 •

edited

Loading

dssgabriel commented Feb 14, 2025 •

edited

Loading

cwpearson commented Feb 14, 2025 •

edited

Loading

cwpearson commented Feb 14, 2025 •

edited

Loading

cwpearson commented Feb 17, 2025 •

edited

Loading

cwpearson commented Feb 17, 2025 •

edited

Loading

dssgabriel commented Feb 18, 2025 •

edited

Loading

olivier-snl commented Feb 18, 2025 •

edited

Loading