Harness CUDA's shared memory to speed up GPU calculations #517
Description
We'll soon have some GPU implementation of Pairwise distance functionality: https://github.com/pystatgen/sgkit/pull/498
Primer on GPU and the Problem
The architecture of a GPU is divided into grids and each grid contains some blocks and each block contains threads and threads is where all the calculation happens.
During any calculation each thread reads from DRAM except unless if finds the item in some cache. Now reading from DRAM is slow (read that very slow).
Example:
Imagine a situation where we calculate pairwise distance between a bunch of vectors in a chunk of 2D vector:
[
v0,
v1,
v2,
v3,
]
Each of these calculations are done by separate threads, now here you can see:
- Thread 1 will calculate (v0, v1),
- Thread 2 will calculate(v0, v2),
now can you can see here Thread 1 and Thread 2 both are loading v0 array from the memory and so on...
Solution
Numba's CUDA API provides a way to share memory between threads in a block, this will let the threads in a block load data into shared memory only once and reuse that every time by all the threads.
Reference API: https://numba.pydata.org/numba-doc/latest/cuda/memory.html#shared-memory-and-thread-synchronization
This exercise can give us significant speed up in GPU calculations.