Description
Is your feature request related to a problem? Please describe.
CUDA.jl doesn't provide APIs to specify memory ordering of atomic operations. In particular, it would be nice to be able to use the monotonic (aka relaxed) ordering for applications like histogram and gradient descent.
Describe the solution you'd like
Provide the full set of Julia 1.7 atomic orderings.
Describe alternatives you've considered
Using monotonic always in CUDA.atomic_*
and CUDA.@atomic
may actually be a decent solution, given the dominating applications in Julia. Other orderings don't matter unless you want to do very intricate concurrent programming. However, it would be nice to align the syntax to Base
to facilitate writing code that is usable both on GPU and CPU.
Additional context
I noticed atomics.jl is using
CUDA.jl/src/device/intrinsics/atomics.jl
Lines 13 to 17 in 55ed099
by default but I couldn't find the API to set the ordering.
I'm also a bit confused about atomics situation on CUDA in general. Looking at CUDA C++ Programming Guide, B.14. Atomic Functions says
Atomic functions do not act as memory fences and do not imply synchronization or ordering constraints for memory operations (see Memory Fence Functions for more details on memory fences). Atomic functions can only be used in device functions.
Indeed, functions like atomicCAS
and atomicAdd
don't seem to take ordering as arguments. So, I wonder if they are all monotonic?
On the other hand, https://github.com/NVIDIA/libcudacxx does seem to try to provide C++'s std atomics interface. I can see that different assemblies are generated with different orderings:
cuda::std::atomic::compare_exchange_weak
: https://godbolt.org/z/nYqKx6WE8cuda::std::atomic::fetch_add
,atomicAdd
,atomicAdd_system
: https://godbolt.org/z/o6areY84z (I'm bit puzzled thatstd::atomic
andatomicAdd
produce different instructions)
So I guess it means actually the hardware supports different orderings? (But previously hasn't been exposed to the programmer before libcu++?)
Anyway, I think it'd be nice if we have access to these instructions, I think it's important to at least provide monotonic ordering which is what you need for "reduction" resulting into a large object like histogram.