GPU performance improvements #488

DiamonDinoia · 2024-07-17T15:31:55Z

Possible improvements to GPU perfomance are:

binsize determined by the available shared memory
using integer and float32 arithmetic instead of the more expensive float64.
fixed minor issues on cmake
Updated horner coefficient to use the same as the CPU version

#481 summarizes the achieved performance.

…into gpu-optimizations

lu1and10

Looks good to me, just left some minor comments in the files.

CMakeLists.txt

lu1and10 · 2024-07-30T14:12:13Z

src/cuda/1d/spread1d_wrapper.cu

+    int *d_idxnupts = d_plan->idxnupts;
+    thrust::sequence(thrust::cuda::par.on(stream), d_idxnupts, d_idxnupts + M);
+    RETURN_IF_CUDA_ERROR
+    thrust::sort(thrust::cuda::par.on(stream), d_idxnupts, d_idxnupts + M,


Does thrust sort will also be faster than current bin sort in 2D and 3D? Though sort only takes few percentage in 2D and 3D.

One thing to note is that thrust sort(most probably calls cub sort) will create a workspace during sorting, so the GPU memory may have a little spike, while current binsort's memory is all managed by ourselves.

lu1and10 · 2024-07-30T14:19:44Z

src/cuda/common.cu

+        throw std::runtime_error(cudaGetErrorString(err));
+      }
+      // use 1/6 of the shared memory for the binsize
+      shared_mem_per_block /= 6;


is this 1/6 heuristic getting from perf test experiments or some theory?

src/ker_horner_allw_loop.inc

DiamonDinoia · 2024-08-01T02:11:00Z

src/cuda/2d/spreadinterp2d.cuh

+    const T *x, const T *y, const cuda_complex<T> *c, cuda_complex<T> *fw, int M, int ns,
+    int nf1, int nf2, T es_c, T es_beta, T sigma, const int *idxnupts) {
+#if ALLOCA_SUPPORTED
+  auto ker                = (T *)alloca(sizeof(T) * ns * 3);


I need to fix the *3 here

blackwer

What a beast of a PR. LGTM. A lot to take in but everything seems OK on the surface.

Did you notice any significant improvements you could achieve with the reduced memory pressure from alloca? I don't especially love dealing with VLAs, but it's probably OK here if there's an obvious advantage, especially if this remains in cuda support for future specs.

janden · 2024-08-01T19:37:35Z

Sorry I haven't had a change to look at this yet. Will go through it tomorrow.

DiamonDinoia · 2024-08-01T20:54:07Z

What a beast of a PR. LGTM. A lot to take in but everything seems OK on the surface.

Did you notice any significant improvements you could achieve with the reduced memory pressure from alloca? I don't especially love dealing with VLAs, but it's probably OK here if there's an obvious advantage, especially if this remains in cuda support for future specs.

Hi Robert,

Thanks for the review, alloca makes a small difference but I think it is worth having it in as registers/stack is quite precious on GPU. We are limited by shared memory more than register at the moment so it is not a huge improvement. If it becomes un-maintainable we can pull out but nvidia will likely not drop support for it.

DiamonDinoia · 2024-08-01T20:54:27Z

I added 1.25 upsampfact unit test since review, no new feature.

janden

Looks great! Thanks for doing this. Just have a few questions and comments here and there.

src/cuda/1d/spread1d_wrapper.cu

devel/gen_all_horner_C_code.m

include/cufinufft/impl.h

include/cufinufft/utils.h

src/cuda/1d/spread1d_wrapper.cu

src/cuda/3d/spread3d_wrapper.cu

src/cuda/common.cu

janden · 2024-08-02T22:02:59Z

Good here as far as I'm concerned. Nice work!

DiamonDinoia added 11 commits July 3, 2024 09:43

basic benchmarks

45333fa

added plotting script

b95a082

optimised plotting

ae55ca5

fixed plotting and metrics

16e27f0

fixed the plot script

49d1f21

bin_size_x is as function of the shared memory available

2fdae68

bin_size_x is as function of the shared memory available

c0d9923

minor optimizations in 1D

907797c

otpimized nupts driven

60f4780

Optimized 1D and 2D

35dcc66

Merge branch 'master' into gpu-optimizations

e1ad9bb

DiamonDinoia mentioned this pull request Jul 17, 2024

Towards 2.3 #490

Closed

8 tasks

DiamonDinoia added this to the 3.0 milestone Jul 17, 2024

DiamonDinoia mentioned this pull request Jul 17, 2024

Towards 2.4 #491

Open

6 tasks

DiamonDinoia removed this from the 2.3 milestone Jul 17, 2024

DiamonDinoia added 15 commits July 18, 2024 15:18

3D integer operations

366295d

3D SM and GM optimized

24bf6be

bump cuda version

960117a

Merge remote-tracking branch 'flatiron/master' into gpu-optimizations

4295a86

changed matlab to generate necessary cuda upsampfact files

c1b14c6

added new coeffs

f300d2d

Merge remote-tracking branch 'refs/remotes/origin/gpu-optimizations' …

e86c762

…into gpu-optimizations

restoring .m from master

db0457a

updated hook

d0ce11e

updated matlab upsampfact

513ce4b

updated coefficients

798717d

new coeffs

282baf5

updated cufinufft to new coeff

12822a2

Merge remote-tracking branch 'flatiron/master' into gpu-optimizations

badf22f

Merge remote-tracking branch 'flatiron/master' into gpu-optimizations

bf6328b

DiamonDinoia requested review from blackwer, janden and lu1and10 July 26, 2024 16:50

added comments for review

db80aad

lu1and10 reviewed Jul 30, 2024

View reviewed changes

lu1and10 mentioned this pull request Jul 30, 2024

Update and simplify cmake #495

Merged

7 tasks

ahbarnett mentioned this pull request Jul 30, 2024

CUDA: Tunable block sizes #386

Open

DiamonDinoia added 4 commits July 31, 2024 12:42

fixing review comments

c225fb5

Merge remote-tracking branch 'flatiron/master' into gpu-optimizations

394550f

merged master

5606aa0

fixed cmake

74ccd71

DiamonDinoia commented Aug 1, 2024

View reviewed changes

Gcc-9 fixes; Ker size fixed too

ee28d05

blackwer approved these changes Aug 1, 2024

View reviewed changes

DiamonDinoia added 3 commits August 1, 2024 16:38

windows compatibility tweak; unit testing the 1.25 upsampfact

466ddff

Merge remote-tracking branch 'flatiron/master' into gpu-optimizations

3f60ca4

added forgotten c++17 flag

fb48ff8

janden requested changes Aug 2, 2024

View reviewed changes

DiamonDinoia added 3 commits August 2, 2024 17:42

Merge remote-tracking branch 'flatiron/master' into gpu-optimizations

5d7e276

Addressing review comments

afabb3f

Added warning

c3df5e1

DiamonDinoia requested a review from janden August 2, 2024 21:55

janden approved these changes Aug 2, 2024

View reviewed changes

updated changelog

44c523b

DiamonDinoia merged commit b3c2be7 into flatironinstitute:master Aug 2, 2024
20 of 21 checks passed

janden mentioned this pull request Aug 6, 2024

Implementing 1.25 upsampling factor with precomputed Horner kernel flatironinstitute/cufinufft#136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU performance improvements #488

GPU performance improvements #488

DiamonDinoia commented Jul 17, 2024 •

edited

Loading

lu1and10 left a comment

lu1and10 Jul 30, 2024

lu1and10 Jul 30, 2024

DiamonDinoia Aug 1, 2024

blackwer left a comment

janden commented Aug 1, 2024

DiamonDinoia commented Aug 1, 2024

DiamonDinoia commented Aug 1, 2024 •

edited

Loading

janden left a comment

janden commented Aug 2, 2024

GPU performance improvements #488

GPU performance improvements #488

Conversation

DiamonDinoia commented Jul 17, 2024 • edited Loading

lu1and10 left a comment

Choose a reason for hiding this comment

lu1and10 Jul 30, 2024

Choose a reason for hiding this comment

lu1and10 Jul 30, 2024

Choose a reason for hiding this comment

DiamonDinoia Aug 1, 2024

Choose a reason for hiding this comment

blackwer left a comment

Choose a reason for hiding this comment

janden commented Aug 1, 2024

DiamonDinoia commented Aug 1, 2024

DiamonDinoia commented Aug 1, 2024 • edited Loading

janden left a comment

Choose a reason for hiding this comment

janden commented Aug 2, 2024

DiamonDinoia commented Jul 17, 2024 •

edited

Loading

DiamonDinoia commented Aug 1, 2024 •

edited

Loading