Description
Efficient GPU kernels for block-sparse matrix multiplication and convolution
I clicked on this thinking it was a general library, maybe OpenCL, scrolled down, and got a bit peeved.
The only code here is written in non-portable CUDA and non-portable GPU assembly; NVidia cards are required unless you do HIP conversion which is no longer necessarily the most efficient kernel.
I might be getting one of their workstation cards later this year to get the best of both worlds, but NVidia aren't the only GPUs; for general purpose compute the current 170% more expensive model gets beat by a 7900XTX. I have no brand preference... actually if the drivers don't clash I plan on having an Arc A770 and an A6000 in this machine alongside hte 7900XTX by the end of the year to get the best of everything for 3D rendering or use the low power Arc for inference since it's as fast as the 7900XTX (with the XTX under the fastest way of running anything, Shark) and both are faster than the A6000 for that, but the NVidia should render some scenes faster and will probably still be the easiest way to do local training given how many libraries assume cuda and how long it will take them to make the slight changes required to fix that. Anyway my point is, tagging this correctly as "Efficient NVidia CUDA / assembly kernels for..." would be the user-friendly thing to do.