-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve reduction tests and fix reduction #139
base: develop
Are you sure you want to change the base?
Conversation
Using the workflow described in the Testing section of our docs, the Environment:
|
fec0675
to
5d48ec8
Compare
Ah this is an intermittent failure on my machine. |
I think something funky is/was going on in our test framework. If I break them out into separate binaries all the tests pass under OpenMPI, but some of the Send/Recv tests fail under MPICH. I wonder if something is subtly wrong with some of our implementation and it was getting MPI into a bad state. |
@dssgabriel can you try again? We were missing a fence in one of our send implementations: and I think it was getting MPI into a bad state. To help avoid this kind of problem in the future, I broke categories of tests out into different binaries. However, on my machine, with CUDA 12.2, OpenMPI 5.0.6 / MPICH 4.2.0, all the tests pass now regardless of whether they're in a single binary or multiple binaries. |
I also discovered that GPU-aware Send/Recv seemed to be broken with an MPICH that used ch4 + libfabric. It works under MPICH configured with ch4 + ucx. Intermittently the |
Sure @cwpearson, checking out the latest push now 👍 |
All tests are passing with OpenMPI 5.0.6 + CUDA 12.6 ✔️ However, I have some failing unit tests with MPICH 4.2.3 (built with libfabric, see
Test 5 output:
Test 6 outputI am not sure why it reports that this test is failing as 32/48 pass correctly and the remaining 16 are just skipped.
Perf tests are also failing with MPICH:
`spack spec` for MPICH on my machine:
|
That's kind of similar to what I was seeing. |
This reminds me of the sorts of things that were going wrong when Kokkos enabled async CUDA malloc by default, which was reverted in 4.5.00. |
That error is in the contiguous case where we really don't do anything beyond calling |
I can reproduce the MPICH problems in pure CUDA + MPI: main.cu#include <mpi.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <iostream>
/*
Author: Carl Pearson (cwpears@sandia.gov)
Basically a send/recv test with two ranks.
Send-side sets buf(i) = i
Recv side checks that that's what it gets.
Optionally, the buffers can be offset from the start of the allocation by a configurable
alignment amount, i.e. the allocated buffer is larger, and we don't use offset 0 as the
beginning of the send/recv buffer
There's a bit of asymmetry that shouldn't matter: both ranks allocate the recv buffer,
only the send side allocates the send buffer.
The reason this is done is because the MPICH errors seem sensitive to how many
CUDA allocations are in the code, and this triggers the error case.
In MPICH 4.2.3 + ofi, we get some errors:
* With a 0 alignment offset, eventually the recv side gets garbage.
* With a 128 alignment offset, we get IPC handle mapping errors.
This works in Open MPI 5.0.5
*/
// Macro to check for CUDA errors
#define CUDA(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
std::cerr << "CUDA error in file '" << __FILE__ \
<< "' in line " << __LINE__ << ": " \
<< cudaGetErrorString(err) << " (" << err << ")" \
<< std::endl; \
exit(EXIT_FAILURE); \
} \
} while (0)
// a(i) = i
template <typename Scalar>
__global__ void init_array(Scalar* a, int sz) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < sz) {
a[i] = Scalar(i);
}
}
// check a(i) = i
template <typename Scalar>
__global__ void check_array(const Scalar* a, int sz, int* errs) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < sz && a[i] != Scalar(i)) {
atomicAdd(errs, 1);
}
}
// get a builting MPI_Datatype for int32_t, int64_t, float
template <typename Scalar>
MPI_Datatype mpi_type() {
if constexpr (std::is_same_v<Scalar, int32_t>) {
return MPI_INT;
} else if constexpr (std::is_same_v<Scalar, int64_t>) {
return MPI_LONG_LONG;
} else if constexpr (std::is_same_v<Scalar, float>) {
return MPI_FLOAT;
} else {
static_assert(std::is_void_v<Scalar>, "unsupported type");
}
}
// if alignment is 0, return ptr
// else, return the next aligned version of ptr+1
void* align_next(void* ptr, std::size_t alignment) {
if (0 == alignment) return ptr;
std::uintptr_t p = reinterpret_cast<std::uintptr_t>(ptr);
// would be p + alignment - 1 if we weren't getting the next one
std::uintptr_t aligned_p = (p + alignment) & ~(alignment - 1);
return reinterpret_cast<void*>(aligned_p);
}
template <typename Scalar>
void run_test(int num_elements, int alignment, bool use_ssend) {
// get a string name of the Scalar type
const char *name;
if constexpr (std::is_same_v<Scalar, int32_t>) {
name = "int32_t";
} else if constexpr (std::is_same_v<Scalar, float>) {
name = "float";
} else if constexpr (std::is_same_v<Scalar, int64_t>) {
name = "int64_t";
} else {
static_assert(std::is_void_v<Scalar>, "unsupported type");
}
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (2 != size) {
MPI_Abort(MPI_COMM_WORLD, 1);
}
if (0 == rank) {
std::cerr << __FILE__ << ":" << __LINE__ << " test: " << num_elements << " " << name << "\n";
}
Scalar* d_recv_buf;
int* d_errs;
size_t buffer_size = num_elements * sizeof(Scalar) + alignment;
CUDA(cudaMalloc(&d_recv_buf, buffer_size));
CUDA(cudaMalloc(&d_errs, sizeof(int)));
CUDA(cudaMemset(d_errs, 0, sizeof(int)));
Scalar* recv_buf = reinterpret_cast<Scalar*>(align_next(d_recv_buf, alignment));
if (rank == 0) {
Scalar* d_send_buf;
CUDA(cudaMalloc(&d_send_buf, buffer_size));
Scalar* send_buf = reinterpret_cast<Scalar*>(align_next(d_send_buf, alignment));
init_array<<<(num_elements + 255) / 256, 256>>>(send_buf, num_elements);
CUDA(cudaDeviceSynchronize());
std::cerr << __FILE__ << ":" << __LINE__ << " send: " << d_send_buf << " " << send_buf << "\n";
if (use_ssend) {
MPI_Ssend(send_buf, num_elements, mpi_type<Scalar>(), 1, 0, MPI_COMM_WORLD);
} else {
MPI_Send(send_buf, num_elements, mpi_type<Scalar>(), 1, 0, MPI_COMM_WORLD);
}
CUDA(cudaFree(d_send_buf));
} else if (rank == 1) {
int h_errs = 0;
std::cerr << __FILE__ << ":" << __LINE__ << " recv: " << d_recv_buf << " " << recv_buf << "\n";
MPI_Recv(recv_buf, num_elements, mpi_type<Scalar>(), 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
check_array<<<(num_elements + 255) / 256, 256>>>(recv_buf, num_elements, d_errs);
CUDA(cudaMemcpy(&h_errs, d_errs, sizeof(int), cudaMemcpyDeviceToHost));
if (h_errs) {
std::cerr << __FILE__ << ":" << __LINE__ << " h_errs=" << h_errs << "\n";
MPI_Abort(MPI_COMM_WORLD, 1);
}
}
CUDA(cudaFree(d_recv_buf));
CUDA(cudaFree(d_errs));
}
template<typename Scalar>
void run_test() {
const int alignment = 128;
for (size_t _ : {0,1,2}) {
for (size_t n : {113, 16, 8, 4, 2, 1}) {
MPI_Barrier(MPI_COMM_WORLD);
run_test<Scalar>(n, alignment, false /* MPI_Send */);
MPI_Barrier(MPI_COMM_WORLD);
}
// for (size_t n : {113, 16, 8, 4, 2, 1}) {
// run_test<Scalar>(n, alignment, true /*MPI_Ssend*/);
// }
}
}
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
run_test<int>();
run_test<int64_t>();
run_test<float>();
MPI_Finalize();
return 0;
} CMakeLists.txtcmake_minimum_required(VERSION 3.23 FATAL_ERROR)
project(MpichOfi LANGUAGES CXX CUDA)
find_package(MPI REQUIRED)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)
enable_language(CUDA)
add_executable(main main.cu)
target_link_libraries(main MPI::MPI_CXX)
set_target_properties(main PROPERTIES
CUDA_ARCHITECTURES "86"
) With
@dssgabriel can you reproduce this CUDA+MPI case as well with that same MPICH you used before? |
@cwpearson I can reproduce and obtain the same crash output as you do (using the same CUDA + MPICH configuration I tested last week):
Full outputLaunch command is
|
This is a possible bug in MPICH: Either way that issue may help shed some light. |
As Ken Raffenetti mentioned in the answer to your issue, if I run with I am not sure what this means for us and what needs fixing though... Output
|
I don't have any particular knowledge about MPICH's development, but if they do conclude it's a bug they might be able to point us at an older version of MPICH that doesn't have it for us to test again. This will also be relevant for standing up GPU CI builds. We could also test with that environment variable set, but disabling the IPC cache is probably not what you want if you care about latency. |
In any case, let me tidy up the CUDA/MPI reproducer into a unit test, and we can maybe get this merged in. |
Sure! I might approve only tomorrow morning though, depending on when this is ready (almost 11pm here 😆) |
Okay, the final conclusion is a combination of MPICH requiring dynamic linking against cudart, and also we need to link MPI before CUDA . I've incorporated both those changes into this PR. |
# gh/pmodels/mpich#7304: MPICH 4.2.3 and 4.3.0 (and maybe earlier) rely on being | ||
# linked before CUDA and friends, so link MPI before Kokkos. | ||
# TODO_CMAKE: in CMake 3.31, we can use LINK_LIBRARIES_STRATEGY to explicitly ask CMake not to | ||
# reorder how libraries are linked. In the mean time, we have to just hope (?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this behavior MPICH-specific or does it also apply to Open MPI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open MPI worked either way in my tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Everything is passing on my end.
Do we want to merge #130 first, or should we drop it and merge this directly instead? (as it includes the changes) |
Is the order important with the dynamic linking? I am not totally sure that we can control the order in which the dynamic loader will proceed. |
I would rather merge the 2 PRs. |
I think I can split this one up into some individual pieces too. |
Unfortunately, it seems the order matters. This is discussed somewhat in pmodels/mpich#7304. CMake 3.31 seems to have an option where we can prevent libraries from being reordered. Prior to that, CMake generally seems to respect the order in which you attach link options to the target, but as you identify, I don't think it promises to do so. |
Alternatively we could require CMake 3.31? I believe this is the default version on Ubuntu 24.04 and CMake is pretty easy to install in userspace anyway. |
I don't remember who pointed it out, but someone had recalled that Kokkos doesn't require such a new version and perhaps we wanted to align with Kokkos there. I'm not endorsing that, but I wish a remember who had that concern. |
I'd prefer to stick with what Kokkos requires unless it's an absolute crisis. |
1bc50a6
to
f5be308
Compare
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
Signed-off-by: Carl Pearson <cwpears@sandia.gov>
f5be308
to
8c958bd
Compare
Breaking this up into multiple PRs