Description
KokkosComm currently exposes communication functions with an ExecSpace
parameter whose purpose is not semantically clear, neither in the code nor in the documentation, and that doesn't map to the behavior we expect.
The documentation for, e.g. KokkosComm::isend
, states:
template<CommunicationMode SendMode, KokkosExecutionSpace ExecSpace, KokkosView SendView> Req isend(const ExecSpace &space, const SendView &sv, int dest, int tag, MPI_Comm comm)Parameters:
space
: The execution space to operate in
...Template Parameters:
ExecSpace
: A Kokkos execution space to operate
...
This lets users rightfully think that the space
parameter is used for specifying in which execution space the communication has to happen, not the one they must sync with for their communication to be correctly processed.
However, the actual implementation of KokkosComm::isend
(for contiguous views that don't require packing) does something like:
space.fence(); // can't issue isend until work in space is complete
MPI_Isend(/*...*/);
Why can't we submit more work in the space? What does this fence has to do with the execution space in which the communication operates?
Let's demonstrate these unclear semantics with a code example:
// Partition the execution space into two
auto instances = Kokkos::partition_space(space, 1, 1);
auto compute_space = instances[0];
auto comm_space = instances[1];
// Dispatch some parallel work that prepares our data `v` on the compute exec space
Kokkos::parallel_for(Kokkos::RangePolicy(compute_space, 0, v.extent(0)), [=](int const i) {
v(i) = i;
});
// Send the prepared data using the communication exec space
KokkosComm::isend(comm_space, v, dest, tag, comm).wait(); // WARNING: `v` may not have been ready!
In this snippet, we split an execution space into two. Because we're specifying the compute_space
execution space to the parallel_for
, it is asynchronous and we must fence on it so that v
is done initializing before calling KokkosComm::isend
.
However, the implementation of the latter fences on the given execution space (here, comm_space
) before doing the actual send. In this example, this is the wrong space to fence on, and the fence is useless.
Passing an execution space can be useful — e.g. for specifying where to pack a non-contiguous view, or use a particular CUDA stream once we have the NCCL backend — but it shouldn't be just for us to manually fence on an execution space that may have nothing to do with the view we operate on.
I propose that we:
- Clarify the purpose and semantics of the execution space parameter;
- Explain to users that is up to them to guarantee that their data is ready to be used by KokkosComm (i.e. in the example above, they must call
compute_space.fence()
themselves before calling us); - Remove the call to
space.fence()
from our functions that take an execution space (in the contiguous case).
Given that we would not fence on the execution space anymore, the plans
proposed in #100 won't be needed, as we remove all pointless fences from our implementations.
Activity