Open
Description
In CUDA C you can explicitly request vectorized loads/stores using the special vector types (float2
, float4
). Sometimes I found those useful to squeeze out the last bit of performance. This definitely isn't high priority, but I was wondering how hard would be to add something similar to CUDAnative
.
JuliaGPU/CUDAnative.jl#174 is related, but maybe some of the problems have been solved ?