Skip to content

Memory ordering APIs for atomic operations #1353

Open
@tkf

Description

@tkf

Is your feature request related to a problem? Please describe.

CUDA.jl doesn't provide APIs to specify memory ordering of atomic operations. In particular, it would be nice to be able to use the monotonic (aka relaxed) ordering for applications like histogram and gradient descent.

Describe the solution you'd like

Provide the full set of Julia 1.7 atomic orderings.

Describe alternatives you've considered

Using monotonic always in CUDA.atomic_* and CUDA.@atomic may actually be a decent solution, given the dominating applications in Julia. Other orderings don't matter unless you want to do very intricate concurrent programming. However, it would be nice to align the syntax to Base to facilitate writing code that is usable both on GPU and CPU.

Additional context

I noticed atomics.jl is using

# all atomic operations have acquire and/or release semantics,
# depending on whether they load or store values (mimics Base)
const atomic_acquire = LLVM.API.LLVMAtomicOrderingAcquire
const atomic_release = LLVM.API.LLVMAtomicOrderingRelease
const atomic_acquire_release = LLVM.API.LLVMAtomicOrderingAcquireRelease

by default but I couldn't find the API to set the ordering.

I'm also a bit confused about atomics situation on CUDA in general. Looking at CUDA C++ Programming Guide, B.14. Atomic Functions says

Atomic functions do not act as memory fences and do not imply synchronization or ordering constraints for memory operations (see Memory Fence Functions for more details on memory fences). Atomic functions can only be used in device functions.

Indeed, functions like atomicCAS and atomicAdd don't seem to take ordering as arguments. So, I wonder if they are all monotonic?

On the other hand, https://github.com/NVIDIA/libcudacxx does seem to try to provide C++'s std atomics interface. I can see that different assemblies are generated with different orderings:

So I guess it means actually the hardware supports different orderings? (But previously hasn't been exposed to the programmer before libcu++?)

Anyway, I think it'd be nice if we have access to these instructions, I think it's important to at least provide monotonic ordering which is what you need for "reduction" resulting into a large object like histogram.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cuda kernelsStuff about writing CUDA kernels.enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions