Lower-level kernel form?

This is mainly to start a conversation around the KA kernel language, as it currently starts accumulating more functionality / cruft; for example, if I want a high-performance kernel as written in raw CUDA C++ (but backend- and type-agnostic and having all the Julia niceties), kernels would start to look like:

```julia
@kernel unsafe_indices=true cpu=false inbounds=true function somekernel(arg1, @Const(arg2))
    ...
end
```

What I'd expect by default - a GPU kernel with comparable performance to CUDA - is not really what the language *guides me to* by default, as I need to add `@kernel unsafe_indices=true cpu=false inbounds=true` to get close. Even then, with the recent `@synchronize` [lane checks](https://github.com/JuliaGPU/KernelAbstractions.jl/pull/564), we see big performance hits in previously well-performing code (e.g. from 540 ms to 1.54 s [for a sum - see issue](https://github.com/JuliaGPU/AcceleratedKernels.jl/issues/25)).

Perhaps this is the point where I should emphasise how much I appreciate KernelAbstractions and the titanic work put into it and the JuliaGPU ecosystem. I hope this post does not come across as sweeping criticism, but a discussion for possible future improvements (of course, here "improvements" being simply my personal opinion based on the work I do - and how I'm using KA for HPC code).

Having followed KA development for a few years now, I understand the constraints that evolved the current KA interface - implicit boundschecks, separate CPU and GPU compilation pipelines, `ndrange` being, well, a range and not the `blocksize` and `nblocks` seen in CUDA, divergent `synchronize`, etc.

Would there be a possibility for, say, a `@rawkernel`, with more minimal functionality:

```julia
@rawkernel function somekernel(arg1, @const(arg2))
    # Closely mimic the typical GPU API (CUDA, OpenCL) only exposing the local and block indices
    bi = get_group_id(1)
    li = get_local_id(1)
    @synchronize()
    ...
end

# Example syntax to get point across - I don't care much for that now, just the functionality
block_size = 128
somekernel{backend, block_size}(arg1, arg2, nblocks=18)
```

Or more JuliaGPU-like kernel syntax:

```julia
function somekernel(arg1, @const(arg2))
    bi = get_group_id(1)
    li = get_local_id(1)
    @synchronize()
    ...
end

result = @ka backend block_size=128 blocks=18 somekernel(arg1, arg2)

# Or create callable object
fkernel = @ka backend block_size=128 somekernel
fkernel(arg1, arg2, blocks=18)
```

Which would very closely map to the GPU backend's kernel language; I think this would have a few advantages:
- Simpler to implement and maintain: e.g. no need to inject divergent synchronization checks.
- Simpler to transpile to the right GPU backend (maybe even transpile Julia-to-Julia, then let the backend do the work?).
- Simpler, more concise syntax.
- More consistent usage with the corresponding JuliaGPU `@cuda`, `@metal`, etc. kernels.
- And most importantly, performance as you'd expect from the equivalent CUDA C++ kernel.

What are your thoughts?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower-level kernel form? #578

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lower-level kernel form? #578

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions