-
Notifications
You must be signed in to change notification settings - Fork 20
getNeighborPairs() supports periodic boundary conditions #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I added the CUDA implementation. It mostly works, but
Since I added |
Just return an empty tensor |
Thanks! I made the change and the test now passes. When I run the complete
After a while of debugging, I figured out it doesn't really have anything to do with that test. It's actually caused by |
Any suggestions about what to do with The obvious solution is not to run that test on CUDA. |
One option is to call |
If pytorch doesn't provide a safe way to reset the device, going behind its back to call a CUDA function directly will likely cause errors as well. That will invalidate all its existing handles to resources on the GPU, but it doesn't know they've been invalidated. For the moment, I've limited that test to CPU. It's not ideal, but I don't have a better solution. |
Each test is run in a separate process. So, after the device is reset PyTorch will follow with normal initialization for the next test. |
You're welcome to see if you can figure out a way to get it to work. But in the mean time, let's not hold up a useful feature over a broken unit test that isn't even related to the new feature. |
OK! Let's disable the test for now. What else is missing to finish this PR? |
It's all ready for review. |
Great! I'll look at it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, all the requirements of box_vectors
can be checked. It would prevent some invalid simulations. The performance impact would be minimal or none (if the CUDA Graphs are used).
@@ -25,6 +28,11 @@ static tuple<Tensor, Tensor, Tensor> forward(const Tensor& positions, | |||
|
|||
TORCH_CHECK(cutoff.to<double>() > 0, "Expected \"cutoff\" to be positive"); | |||
|
|||
if (box_vectors.size(0) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could check here if all the requirements are satisfied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the checks in the CPU version.
@@ -100,6 +114,12 @@ public: | |||
TORCH_CHECK(max_num_neighbors_ > 0 || max_num_neighbors_ == -1, | |||
"Expected \"max_num_neighbors\" to be positive or equal to -1"); | |||
|
|||
const bool use_periodic = (box_vectors.size(0) != 0); | |||
if (use_periodic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could check here if all the requirements are satisfied too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to check it efficiently? The shape of the tensor is known by the CPU, but the values of its elements are stored on the GPU. If we want to throw an exception based on the values of elements, that will require device to host data transfers and add significant latency every time it's called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, it won't be efficient. So the checks have to be done on a GPU.
The only problem is how to abort a kernel elegantly. assert
does the jobs, but later a GPU needs a reset. In the CUDA docs, I don't see anything better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU asserts aren't really effective as a way of catching user errors. In addition to the fact that you can't recover from them and have to reset the whole device, they don't provide any useful information to the user. They just get a cryptic "device-side assert triggered" message that tells nothing about what the problem was or how to fix it. The user will usually conclude that your library is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we need to choose between two evils:
- Users get cryptic messages and need to reset the GPU.
- Users get incorrect results silently.
@RaulPPelaez how do you handle kernel errors in your code? We need something to be:
|
AFAIK there is no clean way to assert with CUDA. As you mentioned device assert leaves the CUDA context in an unusable state. If you think about it, this is the way errors work in CUDA. You need to manually synchronize to query the current error state. e.g |
What is the intended way of using this functionality? If something like that is the case here an extra parameter could be passed to choose whether or not to synchronize and check for a tooManyNeighbours error flag, to find the max number of neighbours as a precomputation. When constructing the CUDA graph this check would be omitted. |
That sounds like a good approach. Let's merge this now and add error checking along those lines in a separate PR. That's going to require significant design to figure out an efficient mechanism for the error reporting. |
@peastman let's merge this! @RaulPPelaez could you open a dedicated issue to discuss and design the error check? |
I've implemented the CPU version but not the CUDA version so far. Please take a look and see if the API and implementation look OK.