gpu-next: using VK_KHR_cooperative_matrix extension #12144

ghost · 2023-08-12T21:00:09Z

No description provided.

bjin · 2023-08-13T04:19:03Z

Just to be clear, adding support of this to mpv or libplacebo is the least important blocking issue. It could be as simple as adding a "#EXTENSION" directive to shader prelude (probably via "//!EXTENTION").

Utilizing this extension (in user shader), however, is quite complicated. This is especially true for CNN shaders like FSRCNNx and Anime4k. It basically means writing a new shader from scratch with 10x complexity (compute shader, subgroup, buffer storage, batch processing, fp16 ...). And even if all these are done, There are different subgroup-size/coopMatMul kernel size available from different vendor implementation, and their performance will vary between different GPUs. Modern DL framework like pytorch and tensorflow will actually compare and choose different kernel at runtime for best performance.

So, instead of opening meaningless feature request here, you should probably go to those repos and open FR there.

side story:

I thought about using this extension in my nnedi3 shader, because it's a much simpler case: single layer and kernel size (8x4 and 8x6) happens to be multiple of 16 so no routines needed for leftovers. But it still requires a lot of effort, and probably too much for a somehow outdated model like nnedi3.

bjin · 2023-08-13T09:01:26Z

But libplacebo's built-in shaders may still benefit from it.

No, those shaders won't. Fast matrix multiplication only benefits massive convolution kernel with large number of input channels (8 to be precise).

cyanreg · 2023-08-13T09:09:14Z

It could benefit dither generation, since you can create whatever noise pattern you'd like in the frequency domain and then do a DCT to get a spatial rep.

But, outside of libplacebo, it could benefit some cases like denoising, or any sort of frequency domain block processing. And, of course, it could be used in its intended use in a neural network.
It is very platform-specific though, since AFAIK the matrix sizes on Nvidia and AMD differ, as well as the matrix types.

haasn · 2023-08-13T18:27:04Z

I would definitely be interested in a fast full image DCT implementation. Especially in combination with a FREQUENCY/PHASE hook stage to allow user shaders to apply arbitrary transformations to the image in the fourier domain.

Could potentially use this e.g. instead of cascading gaussian blurs for very large blur factors, and maybe for extreme downscaling (16x or more) which is prohibitively expensive with conventional convolution (especially when the convolution kernel is unrolled), obviously denoising, some types of film grain generation, full frame blue noise dithering, etc...

ghost added the meta:feature-request label Aug 12, 2023

sfan5 added the down-upstream features and bugs that need to be implemented and fixed upstream label Aug 13, 2023

sfan5 closed this as completed Aug 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu-next: using VK_KHR_cooperative_matrix extension #12144

gpu-next: using VK_KHR_cooperative_matrix extension #12144

ghost commented Aug 12, 2023 •

edited by ghost

Loading

bjin commented Aug 13, 2023 •

edited

Loading

bjin commented Aug 13, 2023 •

edited

Loading

cyanreg commented Aug 13, 2023

haasn commented Aug 13, 2023

gpu-next: using VK_KHR_cooperative_matrix extension #12144

gpu-next: using VK_KHR_cooperative_matrix extension #12144

Comments

ghost commented Aug 12, 2023 • edited by ghost Loading

bjin commented Aug 13, 2023 • edited Loading

bjin commented Aug 13, 2023 • edited Loading

cyanreg commented Aug 13, 2023

haasn commented Aug 13, 2023

ghost commented Aug 12, 2023 •

edited by ghost

Loading

bjin commented Aug 13, 2023 •

edited

Loading

bjin commented Aug 13, 2023 •

edited

Loading