Harness CUDA's shared memory to speed up GPU calculations

We'll soon have some GPU implementation of Pairwise distance functionality: https://github.com/pystatgen/sgkit/pull/498

## Primer on GPU and the Problem

The architecture of a GPU is divided into grids and each grid contains some blocks and each block contains threads and threads is where all the calculation happens.

During any calculation each thread reads from DRAM except unless if finds the item in some cache. Now reading from DRAM is slow (read that very slow). 

### Example:

Imagine a situation where we calculate pairwise distance between a bunch of vectors in a chunk of 2D vector: 

```
[
    v0,
    v1,
    v2,
    v3,
]
```

Each of these calculations are done by separate threads, now here you can see:

- Thread 1 will calculate (v0, v1),
- Thread 2 will calculate(v0, v2),

now can you can see here Thread 1 and Thread 2 both are loading v0 array from the memory and so on...

## Solution

Numba's CUDA API provides a way to share memory between threads in a block, this will let the threads in a block load data into shared memory only once and reuse that every time by all the threads.

Reference API: https://numba.pydata.org/numba-doc/latest/cuda/memory.html#shared-memory-and-thread-synchronization

This exercise can give us significant speed up in GPU calculations.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harness CUDA's shared memory to speed up GPU calculations #517

Primer on GPU and the Problem

Example:

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development