Description
We need to design and integrate an efficient way to detect an error state in a CUDA kernel and capture it in the CPU. In particular detecting when some particle has more neighbours than the maximum allowed, replacing this:
which currently leaves the CUDA context in an invalid state, requiring a full context reset.
Additionally, the strategy should be compatible with CUDA graphs.
This is related to this PR #70
The main difficulty here is that there is no way to communicate information between a kernel and the CPU that does not involve a synchronization barrier and a memory copy.
I think we should go about this in a similar way as the native CUDA reporting goes, by somehow building an error checking function into the interface that is allowed to synchronize and memcpy.
The class building the list, here
could own a device array/value storing error states (maybe an enum, or a simple integer), the function building the neighbour list would atomically set this error state instead of the assertion above.
Then, checking this error state in the CPU should be delayed as much as possible. For instance, before constructing a CUDA graph a series of error-checking calls to
NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu
Lines 102 to 106 in 8b2d427
with increasing max_num_neighbours could be made to determine an upper bound for it. Then a graph is constructed in a way such that this error state is no longer automatically checked.
This has of course the downside that errors would go silent during a simulation, with the code crashing in an uncontrolled way.