Open
Description
Several years ago, we considered (see #266) adding a variant of GPU-STUMP that utilized cooperative groups and that would allow us to push the multiple kernel launches onto the device. Earlier work was concerned about:
- Breaking backwards compatibility
- Adding unnecessary complexity to the code
However, cudatoolkit support is much better now and older GPUs that lack cooperative group support are likely end-of-life (and so the above concerns are likely a thing of the pst now). Additionally, numba
has moved ahead many, many versions since our last attempt. Thus, we should reconsider adding this to STUMPY. PR #266 provides some clear code for how to proceed and had demonstrated a 12% speedup, which is great!
See also the numba
docs on cooperative groups