Skip to content

Harness CUDA's shared memory to speed up GPU calculations #517

Open
@aktech

Description

We'll soon have some GPU implementation of Pairwise distance functionality: https://github.com/pystatgen/sgkit/pull/498

Primer on GPU and the Problem

The architecture of a GPU is divided into grids and each grid contains some blocks and each block contains threads and threads is where all the calculation happens.

During any calculation each thread reads from DRAM except unless if finds the item in some cache. Now reading from DRAM is slow (read that very slow).

Example:

Imagine a situation where we calculate pairwise distance between a bunch of vectors in a chunk of 2D vector:

[
    v0,
    v1,
    v2,
    v3,
]

Each of these calculations are done by separate threads, now here you can see:

  • Thread 1 will calculate (v0, v1),
  • Thread 2 will calculate(v0, v2),

now can you can see here Thread 1 and Thread 2 both are loading v0 array from the memory and so on...

Solution

Numba's CUDA API provides a way to share memory between threads in a block, this will let the threads in a block load data into shared memory only once and reuse that every time by all the threads.

Reference API: https://numba.pydata.org/numba-doc/latest/cuda/memory.html#shared-memory-and-thread-synchronization

This exercise can give us significant speed up in GPU calculations.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions