Efficient error reporting in CUDA

We need to design and integrate an efficient way to detect an error state in a CUDA kernel and capture it in the CPU. In particular detecting when some particle has more neighbours than the maximum allowed, replacing this:
https://github.com/openmm/NNPOps/blob/8b2d427888e548dbb655b6233060e3e8038e92c2/src/pytorch/neighbors/getNeighborPairsCUDA.cu#L67
which currently leaves the CUDA context in an invalid state, requiring a full context reset.

Additionally, the strategy should be compatible with CUDA graphs.
This is related to this PR https://github.com/openmm/NNPOps/pull/70

The main difficulty here is that there is no way to communicate information between a kernel and the CPU that does not involve a synchronization barrier and a memory copy.

I think we should go about this in a similar way as the native CUDA reporting goes, by somehow building an error checking function into the interface that is allowed to synchronize and memcpy.

The class building the list, here
https://github.com/openmm/NNPOps/blob/8b2d427888e548dbb655b6233060e3e8038e92c2/src/pytorch/neighbors/getNeighborPairsCUDA.cu#L100
could own a device array/value storing error states (maybe an enum, or a simple integer), the function building the neighbour list would atomically set this error state instead of the assertion above.

Then, checking this error state in the CPU should be delayed as much as possible. For instance, before constructing a CUDA graph a series of error-checking calls to  
https://github.com/openmm/NNPOps/blob/8b2d427888e548dbb655b6233060e3e8038e92c2/src/pytorch/neighbors/getNeighborPairsCUDA.cu#L102-L106
with increasing max_num_neighbours could be made to determine an upper bound for it. Then a graph is constructed in a way such that this error state is no longer automatically checked.

This has of course the downside that errors would go silent during a simulation, with the code crashing in an uncontrolled way. 


	static tensor_list forward(AutogradContext* ctx,
	const Tensor& positions,
	const Scalar& cutoff,
	const Scalar& max_num_neighbors,
	const Tensor& box_vectors) {

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient error reporting in CUDA #79

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Efficient error reporting in CUDA #79

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions