Skip to content

Lower-level kernel form? #578

Open
Open
@anicusan

Description

@anicusan

This is mainly to start a conversation around the KA kernel language, as it currently starts accumulating more functionality / cruft; for example, if I want a high-performance kernel as written in raw CUDA C++ (but backend- and type-agnostic and having all the Julia niceties), kernels would start to look like:

@kernel unsafe_indices=true cpu=false inbounds=true function somekernel(arg1, @Const(arg2))
    ...
end

What I'd expect by default - a GPU kernel with comparable performance to CUDA - is not really what the language guides me to by default, as I need to add @kernel unsafe_indices=true cpu=false inbounds=true to get close. Even then, with the recent @synchronize lane checks, we see big performance hits in previously well-performing code (e.g. from 540 ms to 1.54 s for a sum - see issue).

Perhaps this is the point where I should emphasise how much I appreciate KernelAbstractions and the titanic work put into it and the JuliaGPU ecosystem. I hope this post does not come across as sweeping criticism, but a discussion for possible future improvements (of course, here "improvements" being simply my personal opinion based on the work I do - and how I'm using KA for HPC code).

Having followed KA development for a few years now, I understand the constraints that evolved the current KA interface - implicit boundschecks, separate CPU and GPU compilation pipelines, ndrange being, well, a range and not the blocksize and nblocks seen in CUDA, divergent synchronize, etc.

Would there be a possibility for, say, a @rawkernel, with more minimal functionality:

@rawkernel function somekernel(arg1, @const(arg2))
    # Closely mimic the typical GPU API (CUDA, OpenCL) only exposing the local and block indices
    bi = get_group_id(1)
    li = get_local_id(1)
    @synchronize()
    ...
end

# Example syntax to get point across - I don't care much for that now, just the functionality
block_size = 128
somekernel{backend, block_size}(arg1, arg2, nblocks=18)

Or more JuliaGPU-like kernel syntax:

function somekernel(arg1, @const(arg2))
    bi = get_group_id(1)
    li = get_local_id(1)
    @synchronize()
    ...
end

result = @ka backend block_size=128 blocks=18 somekernel(arg1, arg2)

# Or create callable object
fkernel = @ka backend block_size=128 somekernel
fkernel(arg1, arg2, blocks=18)

Which would very closely map to the GPU backend's kernel language; I think this would have a few advantages:

  • Simpler to implement and maintain: e.g. no need to inject divergent synchronization checks.
  • Simpler to transpile to the right GPU backend (maybe even transpile Julia-to-Julia, then let the backend do the work?).
  • Simpler, more concise syntax.
  • More consistent usage with the corresponding JuliaGPU @cuda, @metal, etc. kernels.
  • And most importantly, performance as you'd expect from the equivalent CUDA C++ kernel.

What are your thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions