Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve reduction tests and fix reduction #139

Open
wants to merge 22 commits into
base: develop
Choose a base branch
from

Conversation

@dssgabriel
Copy link
Collaborator

dssgabriel commented Feb 10, 2025

Using the workflow described in the Testing section of our docs, the Reduce test still fails (although with a different value than previously: was 10, now 65536).

Environment:

  • GCC 11.5
  • CUDA 12.6
  • OpenMPI 5.0.6
  • Kokkos develop branch (fresh pull & install)
3: [----------] 4 tests from Reduce/0, where TypeParam = int
3: [ RUN      ] Reduce/0.1D_contig_contig
3: /tmp/kokkos-comm/unit_tests/mpi/test_reduce.cpp:68: Failure
3: Expected equality of these values:
3:   errs
3:     Which is: 65536
3:   0
3:
3: (rank 0 failed)
3: [UN00317885-UNAL:00000] *** An error occurred in MPI_Reduce
3: [UN00317885-UNAL:00000] *** reported by process [3643342849,0]
3: [UN00317885-UNAL:00000] *** on communicator MPI_COMM_WORLD
3: [UN00317885-UNAL:00000] *** MPI_ERR_TRUNCATE: message truncated
3: [UN00317885-UNAL:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
3: [UN00317885-UNAL:00000] ***    and MPI will try to terminate your MPI job as well)
3: --------------------------------------------------------------------------
3: prterun has exited due to process rank 0 with PID 0 on node UN00317885-UNAL calling
3: "abort". This may have caused other processes in the application to be
3: terminated by signals sent by prterun (as reported here).
3: --------------------------------------------------------------------------
3:
3/3 Test #3: test-main ........................***Failed    0.62 sec

@cwpearson
Copy link
Collaborator Author

Ah this is an intermittent failure on my machine.

@cwpearson
Copy link
Collaborator Author

I think something funky is/was going on in our test framework. If I break them out into separate binaries all the tests pass under OpenMPI, but some of the Send/Recv tests fail under MPICH. I wonder if something is subtly wrong with some of our implementation and it was getting MPI into a bad state.

@cwpearson
Copy link
Collaborator Author

cwpearson commented Feb 14, 2025

@dssgabriel can you try again? We were missing a fence in one of our send implementations:

6d692f8 (#139)

and I think it was getting MPI into a bad state.

To help avoid this kind of problem in the future, I broke categories of tests out into different binaries. However, on my machine, with CUDA 12.2, OpenMPI 5.0.6 / MPICH 4.2.0, all the tests pass now regardless of whether they're in a single binary or multiple binaries.

@cwpearson
Copy link
Collaborator Author

cwpearson commented Feb 14, 2025

I also discovered that GPU-aware Send/Recv seemed to be broken with an MPICH that used ch4 + libfabric. It works under MPICH configured with ch4 + ucx.

Intermittently the MPI_Recv would complete too early (e.g., before the matching MPI_Send was even called) and return some garbage data.

@dssgabriel
Copy link
Collaborator

Sure @cwpearson, checking out the latest push now 👍

@dssgabriel
Copy link
Collaborator

dssgabriel commented Feb 14, 2025

All tests are passing with OpenMPI 5.0.6 + CUDA 12.6 ✔️

However, I have some failing unit tests with MPICH 4.2.3 (built with libfabric, see spack spec at the bottom) + CUDA 12.6:

78% tests passed, 2 tests failed out of 9

Total Test time (real) =   4.18 sec

The following tests FAILED:
          5 - test-sendrecv (Failed)
          6 - test-isendrecv (Failed)
Test 5 output:
5: [ RUN      ] MpiSendRecv/2.1D_contig_standard
5: Abort(273299215) on node 1 (rank 1 in comm 0): Fatal error in internal_Recv: Other MPI error, error stack:
5: internal_Recv(62556)...........: MPI_Recv(buf=0x7f822d0b0080, count=113, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, status=0x1) failed
5: MPIR_Wait(785).................:
5: MPIDI_IPCI_handle_lmt_recv(230):
5: MPIDI_GPU_ipc_handle_map(355)..:  gpu_ipc_handle_map failed
5/9 Test #5: test-sendrecv ....................***Failed    0.50 sec
Test 6 output

I am not sure why it reports that this test is failing as 32/48 pass correctly and the remaining 16 are just skipped.

6: [----------] Global test environment tear-down
6: [==========] 48 tests from 8 test suites ran. (21 ms total)
6: [  PASSED  ] 32 tests.
6: [  SKIPPED ] 16 tests, listed below:
6: [  SKIPPED ] IsendRecv/0.1D_contig_ready
6: [  SKIPPED ] IsendRecv/0.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/1.1D_contig_ready
6: [  SKIPPED ] IsendRecv/1.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/2.1D_contig_ready
6: [  SKIPPED ] IsendRecv/2.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/3.1D_contig_ready
6: [  SKIPPED ] IsendRecv/3.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/4.1D_contig_ready
6: [  SKIPPED ] IsendRecv/4.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/5.1D_contig_ready
6: [  SKIPPED ] IsendRecv/5.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/6.1D_contig_ready
6: [  SKIPPED ] IsendRecv/6.1D_noncontig_ready
6: [  SKIPPED ] IsendRecv/7.1D_contig_ready
6: [  SKIPPED ] IsendRecv/7.1D_noncontig_ready
6/9 Test #6: test-isendrecv ...................***Failed    0.50 sec

Perf tests are also failing with MPICH:

1: benchmark_osu_latency_Kokkos_Comm_mpi_sendrecv/2097152/manual_time         17.3 us         20.2 us        40831 bytes=4.1943M
1: Abort(944387855) on node 1 (rank 1 in comm 0): Fatal error in internal_Recv: Other MPI error, error stack:
1: internal_Recv(62556)...........: MPI_Recv(buf=0x7f17ee200080, count=16777216, MPI_CHAR, 0, 0, MPI_COMM_WORLD, status=0x1) failed
1: MPIR_Wait(785).................:
1: MPIDI_IPCI_handle_lmt_recv(230):
1: MPIDI_GPU_ipc_handle_map(355)..:  gpu_ipc_handle_map failed
1/1 Test #1: perf-test-main ...................***Failed   21.91 sec

0% tests passed, 1 tests failed out of 1
`spack spec` for MPICH on my machine:
[+]  mpich@4.2.3%gcc@11.5.0~argobots+cuda+fortran+hwloc+hydra~level_zero+libxml2+pci~rocm+romio~slurm~vci~verbs+wrapperrpath~xpmem build_system=autotools cuda_arch=89 datatype-engine=auto device=ch4 netmod=ofi pmi=default arch=linux-almalinux9-skylake
[+]      ^cuda@12.6.3%gcc@11.5.0~allow-unsupported-compilers~dev build_system=generic arch=linux-almalinux9-skylake
[e]      ^findutils@4.8.0%gcc@11.5.0 build_system=autotools patches=440b954 arch=linux-almalinux9-skylake
[+]      ^gcc-runtime@11.5.0%gcc@11.5.0 build_system=generic arch=linux-almalinux9-skylake
[e]      ^glibc@2.34%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[e]      ^gmake@4.3%gcc@11.5.0~guile build_system=generic patches=599f134 arch=linux-almalinux9-skylake
[+]      ^hwloc@2.11.1%gcc@11.5.0~cairo~cuda~gl~level_zero~libudev+libxml2~nvml~opencl+pci~rocm build_system=autotools libs=shared,static arch=linux-almalinux9-skylake
[+]          ^ncurses@6.5%gcc@11.5.0~symlinks+termlib abi=none build_system=autotools patches=7a351bc arch=linux-almalinux9-skylake
[+]      ^libfabric@1.22.0%gcc@11.5.0~cuda~debug~kdreg~level_zero~uring build_system=autotools fabrics=sockets,tcp,udp arch=linux-almalinux9-skylake
[+]      ^libpciaccess@0.17%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]          ^util-macros@1.20.1%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]      ^libxml2@2.13.5%gcc@11.5.0~http+pic~python+shared build_system=autotools arch=linux-almalinux9-skylake
[+]          ^libiconv@1.17%gcc@11.5.0 build_system=autotools libs=shared,static arch=linux-almalinux9-skylake
[+]          ^xz@5.4.6%gcc@11.5.0~pic build_system=autotools libs=shared,static arch=linux-almalinux9-skylake
[+]          ^zlib-ng@2.2.3%gcc@11.5.0+compat+new_strategies+opt+pic+shared build_system=autotools arch=linux-almalinux9-skylake
[e]      ^pkgconf@1.7.3%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]      ^yaksa@0.3%gcc@11.5.0+cuda~level_zero~rocm build_system=autotools cuda_arch=none arch=linux-almalinux9-skylake
[e]          ^autoconf@2.69%gcc@11.5.0 build_system=autotools patches=7793209 arch=linux-almalinux9-skylake
[e]          ^automake@1.16.2%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[e]          ^libtool@2.4.6%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[e]          ^m4@1.4.19%gcc@11.5.0+sigsegv build_system=autotools patches=9dc5fbd,bfdffa7 arch=linux-almalinux9-skylake
[+]          ^python@3.13.1%gcc@11.5.0+bz2+ctypes+dbm~debug+libxml2+lzma~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic arch=linux-almalinux9-skylake
[+]              ^bzip2@1.0.8%gcc@11.5.0~debug~pic+shared build_system=generic arch=linux-almalinux9-skylake
[e]                  ^diffutils@3.7%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]              ^expat@2.6.4%gcc@11.5.0+libbsd build_system=autotools arch=linux-almalinux9-skylake
[+]                  ^libbsd@0.12.2%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]                      ^libmd@1.0.4%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[+]              ^gdbm@1.24%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[e]              ^gettext@0.21%gcc@11.5.0+bzip2+curses+git~libunistring+libxml2+pic+shared+tar+xz build_system=autotools arch=linux-almalinux9-skylake
[+]              ^libffi@3.4.6%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake
[e]              ^openssl@3.2.2%gcc@11.5.0~docs+shared build_system=generic certs=mozilla arch=linux-almalinux9-skylake
[+]              ^readline@8.2%gcc@11.5.0 build_system=autotools patches=1ea4349,24f587b,3d9885e,5911a5b,622ba38,6c8adf8,758e2ec,79572ee,a177edc,bbf97f1,c7b45ff,e0013d9,e065038 arch=linux-almalinux9-skylake
[+]              ^sqlite@3.46.0%gcc@11.5.0+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools arch=linux-almalinux9-skylake
[+]              ^util-linux-uuid@2.40.2%gcc@11.5.0 build_system=autotools arch=linux-almalinux9-skylake

@cwpearson
Copy link
Collaborator Author

That's kind of similar to what I was seeing.

@cwpearson
Copy link
Collaborator Author

cwpearson commented Feb 14, 2025

This reminds me of the sorts of things that were going wrong when Kokkos enabled async CUDA malloc by default, which was reverted in 4.5.00.

@cwpearson
Copy link
Collaborator Author

cwpearson commented Feb 14, 2025

That error is in the contiguous case where we really don't do anything beyond calling MPI_Send and MPI_Recv. I'm a bit stumped.

@cwpearson
Copy link
Collaborator Author

cwpearson commented Feb 17, 2025

I can reproduce the MPICH problems in pure CUDA + MPI:

main.cu
#include <mpi.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <iostream>

/*
Author: Carl Pearson (cwpears@sandia.gov)

Basically a send/recv test with two ranks.
Send-side sets buf(i) = i
Recv side checks that that's what it gets.

Optionally, the buffers can be offset from the start of the allocation by a configurable
alignment amount, i.e. the allocated buffer is larger, and we don't use offset 0 as the
beginning of the send/recv buffer

There's a bit of asymmetry that shouldn't matter: both ranks allocate the recv buffer,
only the send side allocates the send buffer.
The reason this is done is because the MPICH errors seem sensitive to how many
CUDA allocations are in the code, and this triggers the error case.

In MPICH 4.2.3 + ofi, we get some errors:
* With a 0 alignment offset, eventually the recv side gets garbage.
* With a 128 alignment offset, we get IPC handle mapping errors.
This works in Open MPI 5.0.5
*/

// Macro to check for CUDA errors
#define CUDA(call)                                           \
    do {                                                                 \
        cudaError_t err = call;                                          \
        if (err != cudaSuccess) {                                        \
            std::cerr << "CUDA error in file '" << __FILE__              \
                      << "' in line " << __LINE__ << ": "                \
                      << cudaGetErrorString(err) << " (" << err << ")"   \
                      << std::endl;                                      \
            exit(EXIT_FAILURE);                                          \
        }                                                                \
    } while (0)

// a(i) = i
template <typename Scalar>
__global__ void init_array(Scalar* a, int sz) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < sz) {
    a[i] = Scalar(i);
  }
}

// check a(i) = i
template <typename Scalar>
__global__ void check_array(const Scalar* a, int sz, int* errs) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < sz && a[i] != Scalar(i)) {
    atomicAdd(errs, 1);
  }
}

// get a builting MPI_Datatype for int32_t, int64_t, float
template <typename Scalar>
MPI_Datatype mpi_type() {
  if constexpr (std::is_same_v<Scalar, int32_t>) {
    return MPI_INT;
  } else if constexpr (std::is_same_v<Scalar, int64_t>) {
    return MPI_LONG_LONG;
  } else if constexpr (std::is_same_v<Scalar, float>) {
    return MPI_FLOAT; 
  } else {
    static_assert(std::is_void_v<Scalar>, "unsupported type");
  }
}

// if alignment is 0, return ptr
// else, return the next aligned version of ptr+1
void* align_next(void* ptr, std::size_t alignment) {
  if (0 == alignment) return ptr;
  std::uintptr_t p = reinterpret_cast<std::uintptr_t>(ptr);
  // would be p + alignment - 1 if we weren't getting the next one
  std::uintptr_t aligned_p = (p + alignment) & ~(alignment - 1); 
  return reinterpret_cast<void*>(aligned_p);
}

template <typename Scalar>
void run_test(int num_elements, int alignment, bool use_ssend) {

  // get a string name of the Scalar type
  const char *name;
  if constexpr (std::is_same_v<Scalar, int32_t>) {
    name = "int32_t";
  } else if constexpr (std::is_same_v<Scalar, float>) {
    name = "float";
  } else if constexpr (std::is_same_v<Scalar, int64_t>) {
    name = "int64_t";
  } else {
    static_assert(std::is_void_v<Scalar>, "unsupported type");
  }

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  if (2 != size) {
    MPI_Abort(MPI_COMM_WORLD, 1);
  }
  
  if (0 == rank) {
    std::cerr << __FILE__ << ":" << __LINE__ << " test: " << num_elements << " " << name << "\n";
  }

  Scalar* d_recv_buf;
  int* d_errs;

  size_t buffer_size = num_elements * sizeof(Scalar) + alignment;
  
  CUDA(cudaMalloc(&d_recv_buf, buffer_size));
  CUDA(cudaMalloc(&d_errs, sizeof(int)));
  CUDA(cudaMemset(d_errs, 0, sizeof(int)));
  Scalar* recv_buf = reinterpret_cast<Scalar*>(align_next(d_recv_buf, alignment));
    
  if (rank == 0) {
    Scalar* d_send_buf;
    CUDA(cudaMalloc(&d_send_buf, buffer_size));
    Scalar* send_buf = reinterpret_cast<Scalar*>(align_next(d_send_buf, alignment));
    init_array<<<(num_elements + 255) / 256, 256>>>(send_buf, num_elements);
    CUDA(cudaDeviceSynchronize());

    std::cerr << __FILE__ << ":" << __LINE__ << " send: " << d_send_buf << " " << send_buf << "\n";
    if (use_ssend) {
      MPI_Ssend(send_buf, num_elements, mpi_type<Scalar>(), 1, 0, MPI_COMM_WORLD);
    } else {
      MPI_Send(send_buf, num_elements, mpi_type<Scalar>(), 1, 0, MPI_COMM_WORLD);
    }

    CUDA(cudaFree(d_send_buf));
  } else if (rank == 1) {

    int h_errs = 0;

    std::cerr << __FILE__ << ":" << __LINE__ << " recv: " << d_recv_buf << " " << recv_buf << "\n";
    MPI_Recv(recv_buf, num_elements, mpi_type<Scalar>(), 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    check_array<<<(num_elements + 255) / 256, 256>>>(recv_buf, num_elements, d_errs);
    CUDA(cudaMemcpy(&h_errs, d_errs, sizeof(int), cudaMemcpyDeviceToHost));
    if (h_errs) {
      std::cerr << __FILE__ << ":" << __LINE__ << " h_errs=" << h_errs << "\n";
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }

  CUDA(cudaFree(d_recv_buf));
  CUDA(cudaFree(d_errs));
}

template<typename Scalar>
void run_test() {
  const int alignment = 128;
  
  for (size_t _ : {0,1,2}) {
    for (size_t n : {113, 16, 8, 4, 2, 1}) {
        MPI_Barrier(MPI_COMM_WORLD);
        run_test<Scalar>(n, alignment, false /* MPI_Send */);
        MPI_Barrier(MPI_COMM_WORLD);
    }
    // for (size_t n : {113, 16, 8, 4, 2, 1}) {
    //   run_test<Scalar>(n, alignment, true /*MPI_Ssend*/);
    // }
  }
}

int main(int argc, char **argv) {
    MPI_Init(&argc, &argv);
    run_test<int>();
    run_test<int64_t>();
    run_test<float>();
    MPI_Finalize();
    return 0;
}
CMakeLists.txt
cmake_minimum_required(VERSION 3.23 FATAL_ERROR)

project(MpichOfi LANGUAGES CXX CUDA)

find_package(MPI REQUIRED)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

enable_language(CUDA)

add_executable(main main.cu)

target_link_libraries(main MPI::MPI_CXX)

set_target_properties(main PROPERTIES
    CUDA_ARCHITECTURES "86"
)

With alignment=0 we eventually just get errors on the recv side.
With alignment=128 we get internal MPI errors in MPI_Recv:

Abort(206190351) on node 1 (rank 1 in comm 0): Fatal error in internal_Recv: Other MPI error, error stack:
internal_Recv(62556)...........: MPI_Recv(buf=0x70239de00080, count=113, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, status=0x1) failed
MPIR_Wait(785).................: 
MPIDI_IPCI_handle_lmt_recv(230): 
MPIDI_GPU_ipc_handle_map(355)..:  gpu_ipc_handle_map failed

@dssgabriel can you reproduce this CUDA+MPI case as well with that same MPICH you used before?

@dssgabriel
Copy link
Collaborator

@cwpearson I can reproduce and obtain the same crash output as you do (using the same CUDA + MPICH configuration I tested last week):

Abort(810170127) on node 1 (rank 1 in comm 0): Fatal error in internal_Recv: Other MPI error, error stack:
internal_Recv(62556)...........: MPI_Recv(buf=0x7fbaeb000080, count=113, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, status=0x1) failed
MPIR_Wait(785).................:
MPIDI_IPCI_handle_lmt_recv(230):
MPIDI_GPU_ipc_handle_map(355)..:  gpu_ipc_handle_map failed
Full output

Launch command is mpiexec -np 2 ./build/main:

/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000600 0x7ff6a3000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000600 0x7ff6a3000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000600 0x7ff6a3000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000800 0x7ff6a3000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000800 0x7ff6a3000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000800 0x7ff6a3000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000400 0x7ff6a3000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7fbaeb000000 0x7fbaeb000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7ff6a3000600 0x7ff6a3000680
Abort(810170127) on node 1 (rank 1 in comm 0): Fatal error in internal_Recv: Other MPI error, error stack:
internal_Recv(62556)...........: MPI_Recv(buf=0x7fbaeb000080, count=113, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, status=0x1) failed
MPIR_Wait(785).................:
MPIDI_IPCI_handle_lmt_recv(230):
MPIDI_GPU_ipc_handle_map(355)..:  gpu_ipc_handle_map failed

@cwpearson
Copy link
Collaborator Author

This is a possible bug in MPICH:

pmodels/mpich#7304

Either way that issue may help shed some light.

@dssgabriel
Copy link
Collaborator

As Ken Raffenetti mentioned in the answer to your issue, if I run with MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE=disabled then the MPICH internal error disappears.

I am not sure what this means for us and what needs fixing though...

Output
$ MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE=disabled mpirun -n 2 build/main

/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int32_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000800 0x7fdd43000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000800 0x7fdd43000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000800 0x7fdd43000880
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 int64_t
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 113 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000600 0x7fdd43000680
/tmp/cuda-mpich-reproducer/main.cu:107 test: 16 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 8 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 4 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 2 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480
/tmp/cuda-mpich-reproducer/main.cu:107 test: 1 float
/tmp/cuda-mpich-reproducer/main.cu:139 recv: 0x7f3447000000 0x7f3447000080
/tmp/cuda-mpich-reproducer/main.cu:127 send: 0x7fdd43000400 0x7fdd43000480

@cwpearson
Copy link
Collaborator Author

cwpearson commented Feb 17, 2025

I don't have any particular knowledge about MPICH's development, but if they do conclude it's a bug they might be able to point us at an older version of MPICH that doesn't have it for us to test again. This will also be relevant for standing up GPU CI builds.

We could also test with that environment variable set, but disabling the IPC cache is probably not what you want if you care about latency.

@cwpearson
Copy link
Collaborator Author

In any case, let me tidy up the CUDA/MPI reproducer into a unit test, and we can maybe get this merged in.

@dssgabriel
Copy link
Collaborator

Sure! I might approve only tomorrow morning though, depending on when this is ready (almost 11pm here 😆)

@cwpearson
Copy link
Collaborator Author

Okay, the final conclusion is a combination of MPICH requiring dynamic linking against cudart, and also we need to link MPI before CUDA . I've incorporated both those changes into this PR.

@cwpearson cwpearson requested a review from dssgabriel February 18, 2025 17:12
Comment on lines +116 to +119
# gh/pmodels/mpich#7304: MPICH 4.2.3 and 4.3.0 (and maybe earlier) rely on being
# linked before CUDA and friends, so link MPI before Kokkos.
# TODO_CMAKE: in CMake 3.31, we can use LINK_LIBRARIES_STRATEGY to explicitly ask CMake not to
# reorder how libraries are linked. In the mean time, we have to just hope (?).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this behavior MPICH-specific or does it also apply to Open MPI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open MPI worked either way in my tests.

dssgabriel
dssgabriel previously approved these changes Feb 18, 2025
Copy link
Collaborator

@dssgabriel dssgabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Everything is passing on my end.

@dssgabriel
Copy link
Collaborator

Do we want to merge #130 first, or should we drop it and merge this directly instead? (as it includes the changes)

@cedricchevalier19
Copy link
Member

Okay, the final conclusion is a combination of MPICH requiring dynamic linking against cudart, and also we need to link MPI before CUDA . I've incorporated both those changes into this PR.

Is the order important with the dynamic linking? I am not totally sure that we can control the order in which the dynamic loader will proceed.

@cedricchevalier19
Copy link
Member

Do we want to merge #130 first, or should we drop it and merge this directly instead? (as it includes the changes)

I would rather merge the 2 PRs.

@cwpearson
Copy link
Collaborator Author

I think I can split this one up into some individual pieces too.

@cwpearson
Copy link
Collaborator Author

Okay, the final conclusion is a combination of MPICH requiring dynamic linking against cudart, and also we need to link MPI before CUDA . I've incorporated both those changes into this PR.

Is the order important with the dynamic linking? I am not totally sure that we can control the order in which the dynamic loader will proceed.

Unfortunately, it seems the order matters. This is discussed somewhat in pmodels/mpich#7304. CMake 3.31 seems to have an option where we can prevent libraries from being reordered. Prior to that, CMake generally seems to respect the order in which you attach link options to the target, but as you identify, I don't think it promises to do so.

@dssgabriel
Copy link
Collaborator

dssgabriel commented Feb 18, 2025

CMake 3.31 seems to have an option where we can prevent libraries from being reordered.

Alternatively we could require CMake 3.31? I believe this is the default version on Ubuntu 24.04 and CMake is pretty easy to install in userspace anyway.
Any reasons for us not to require its most recent version?

@olivier-snl
Copy link
Collaborator

olivier-snl commented Feb 18, 2025

Alternatively we could require CMake 3.31? I believe this is the default version on Ubuntu 24.04 and CMake is pretty easy to install in userspace anyway.
Any reasons for us not to require its most recent version?

I don't remember who pointed it out, but someone had recalled that Kokkos doesn't require such a new version and perhaps we wanted to align with Kokkos there. I'm not endorsing that, but I wish a remember who had that concern.

@cwpearson
Copy link
Collaborator Author

I'd prefer to stick with what Kokkos requires unless it's an absolute crisis.

Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants