Description
This is mainly to start a conversation around the KA kernel language, as it currently starts accumulating more functionality / cruft; for example, if I want a high-performance kernel as written in raw CUDA C++ (but backend- and type-agnostic and having all the Julia niceties), kernels would start to look like:
@kernel unsafe_indices=true cpu=false inbounds=true function somekernel(arg1, @Const(arg2))
...
end
What I'd expect by default - a GPU kernel with comparable performance to CUDA - is not really what the language guides me to by default, as I need to add @kernel unsafe_indices=true cpu=false inbounds=true
to get close. Even then, with the recent @synchronize
lane checks, we see big performance hits in previously well-performing code (e.g. from 540 ms to 1.54 s for a sum - see issue).
Perhaps this is the point where I should emphasise how much I appreciate KernelAbstractions and the titanic work put into it and the JuliaGPU ecosystem. I hope this post does not come across as sweeping criticism, but a discussion for possible future improvements (of course, here "improvements" being simply my personal opinion based on the work I do - and how I'm using KA for HPC code).
Having followed KA development for a few years now, I understand the constraints that evolved the current KA interface - implicit boundschecks, separate CPU and GPU compilation pipelines, ndrange
being, well, a range and not the blocksize
and nblocks
seen in CUDA, divergent synchronize
, etc.
Would there be a possibility for, say, a @rawkernel
, with more minimal functionality:
@rawkernel function somekernel(arg1, @const(arg2))
# Closely mimic the typical GPU API (CUDA, OpenCL) only exposing the local and block indices
bi = get_group_id(1)
li = get_local_id(1)
@synchronize()
...
end
# Example syntax to get point across - I don't care much for that now, just the functionality
block_size = 128
somekernel{backend, block_size}(arg1, arg2, nblocks=18)
Or more JuliaGPU-like kernel syntax:
function somekernel(arg1, @const(arg2))
bi = get_group_id(1)
li = get_local_id(1)
@synchronize()
...
end
result = @ka backend block_size=128 blocks=18 somekernel(arg1, arg2)
# Or create callable object
fkernel = @ka backend block_size=128 somekernel
fkernel(arg1, arg2, blocks=18)
Which would very closely map to the GPU backend's kernel language; I think this would have a few advantages:
- Simpler to implement and maintain: e.g. no need to inject divergent synchronization checks.
- Simpler to transpile to the right GPU backend (maybe even transpile Julia-to-Julia, then let the backend do the work?).
- Simpler, more concise syntax.
- More consistent usage with the corresponding JuliaGPU
@cuda
,@metal
, etc. kernels. - And most importantly, performance as you'd expect from the equivalent CUDA C++ kernel.
What are your thoughts?