Open
Description
In all of our GPU calculations, we never utilize "shared memory" and, instead, rely heavily on storing data in the (fast but limited) "on-chip" and "thread local" GPU register. When the register fills up, data is spilled over to (very slow but more abundant) "off-chip" and "thread local" GPU local memory. However, there is a compromise that is GPU "shared memory" that, in my recent experience, can make computations 3x faster! Of course, coding using shared memory can increase the complexity of the codebase and thereby decreasing the long term readability/maintainability.
This GPU tutorial provides an EXCELLENT overview (especially Tutorial 5 was super enlightening)!