-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU performance improvements #488
GPU performance improvements #488
Conversation
…into gpu-optimizations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just left some minor comments in the files.
int *d_idxnupts = d_plan->idxnupts; | ||
thrust::sequence(thrust::cuda::par.on(stream), d_idxnupts, d_idxnupts + M); | ||
RETURN_IF_CUDA_ERROR | ||
thrust::sort(thrust::cuda::par.on(stream), d_idxnupts, d_idxnupts + M, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does thrust sort will also be faster than current bin sort in 2D and 3D? Though sort only takes few percentage in 2D and 3D.
One thing to note is that thrust sort(most probably calls cub sort) will create a workspace during sorting, so the GPU memory may have a little spike, while current binsort's memory is all managed by ourselves.
throw std::runtime_error(cudaGetErrorString(err)); | ||
} | ||
// use 1/6 of the shared memory for the binsize | ||
shared_mem_per_block /= 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this 1/6 heuristic getting from perf test experiments or some theory?
src/cuda/2d/spreadinterp2d.cuh
Outdated
const T *x, const T *y, const cuda_complex<T> *c, cuda_complex<T> *fw, int M, int ns, | ||
int nf1, int nf2, T es_c, T es_beta, T sigma, const int *idxnupts) { | ||
#if ALLOCA_SUPPORTED | ||
auto ker = (T *)alloca(sizeof(T) * ns * 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to fix the *3 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a beast of a PR. LGTM. A lot to take in but everything seems OK on the surface.
Did you notice any significant improvements you could achieve with the reduced memory pressure from alloca
? I don't especially love dealing with VLAs, but it's probably OK here if there's an obvious advantage, especially if this remains in cuda
support for future specs.
Sorry I haven't had a change to look at this yet. Will go through it tomorrow. |
Hi Robert, Thanks for the review, alloca makes a small difference but I think it is worth having it in as registers/stack is quite precious on GPU. We are limited by shared memory more than register at the moment so it is not a huge improvement. If it becomes un-maintainable we can pull out but nvidia will likely not drop support for it. |
I added 1.25 upsampfact unit test since review, no new feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for doing this. Just have a few questions and comments here and there.
Good here as far as I'm concerned. Nice work! |
Possible improvements to GPU perfomance are:
#481 summarizes the achieved performance.