Skip to content

[SYCL][CUDA] Add initial support for FP atomics #3276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 8, 2021

Conversation

Pennycook
Copy link
Contributor

Generates native FP32 and FP64 atomics with the following flags:
-DSYCL_USE_NATIVE_FP_ATOMICS -Xsycl-target-backend --cuda-gpu-arch=sm_60

Several known issues:

  • __spirv_AtomicFAddExt is not inlined, so order and scope do not propagate
  • Generated PTX does not respect order or scope (defaults to relaxed)
  • Fatal error when compiling with --cuda-gpu-arch <= sm_50

A complete implementation of this feature requires libspirv to be made aware
of __nvvm_reflect, so that NVVMReflect can be used to branch on __CUDA_ARCH.

Signed-off-by: John Pennycook john.pennycook@intel.com

Generates native FP32 and FP64 atomics with the following flags:
-DSYCL_USE_NATIVE_FP_ATOMICS -Xsycl-target-backend --cuda-gpu-arch=sm_60

Several known issues:
- __spirv_AtomicFAddExt is not inlined, so order and scope do not propagate
- Generated PTX does not respect order or scope (defaults to relaxed)
- Fatal error when compiling with --cuda-gpu-arch <= sm_50

A complete implementation of this feature requires libspirv to be made aware
of __nvvm_reflect, so that NVVMReflect can be used to branch on __CUDA_ARCH.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook Pennycook added the cuda CUDA back-end label Mar 1, 2021
@Pennycook Pennycook requested a review from bader as a code owner March 1, 2021 19:21
@Pennycook
Copy link
Contributor Author

A note to reviewers: this covers a common and important use-case (relaxed floating-point atomics), improving performance significantly for that case. It is related to #2853, but shouldn't be considered a complete implementation and shouldn't close that issue.

@bader
Copy link
Contributor

bader commented Mar 3, 2021

+@Naghasan and @steffenlarsen

@steffenlarsen
Copy link
Contributor

Good work! To my knowledge, this should get https://github.com/intel/llvm-test-suite/blob/intel/SYCL/AtomicRef/add-native.cpp working with the CUDA BE. Have you tried removing cuda from UNSUPPORTED and checked if it passes?

Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook
Copy link
Contributor Author

Good work! To my knowledge, this should get https://github.com/intel/llvm-test-suite/blob/intel/SYCL/AtomicRef/add-native.cpp working with the CUDA BE. Have you tried removing cuda from UNSUPPORTED and checked if it passes?

Thanks! The test passes on my local machine, but only with -Xsycl-target-backend --cuda-gpu-arch=sm_60.

I wasn't sure whether to commit changes to the test, or not. I'm not sure how to pass the required flags only when the test is being run on the CUDA backend, and we know that the test will still fail on older NVIDIA GPUs. I convinced myself that we shouldn't claim CUDA support until the test passes for the default GPU arch -- does that make sense?

@steffenlarsen
Copy link
Contributor

I wasn't sure whether to commit changes to the test, or not. I'm not sure how to pass the required flags only when the test is being run on the CUDA backend, and we know that the test will still fail on older NVIDIA GPUs. I convinced myself that we shouldn't claim CUDA support until the test passes for the default GPU arch -- does that make sense?

Makes sense to me, though I'm not sure that is achievable. I vaguely remember the default arch being bumped up to sm_50 due to atomics, so I wouldn't be surprised if floating point (or at least 64-bit floating point) atomics aren't supported before sm_60. If that is the case, as long as the error is clear about the architecture problem I don't think it's necessarily an issue. In that case, this should only become a problem if people use the floating point atomics for hardware that does not support it, otherwise they would just need to add the flag.

As for the test, what happens if you add the --cuda-gpu-arch=sm_60 flag to the general command? Does other backends just ignore it? If so, it could be an option, though it is arguably hacky.

In the future we could consider being smarter about the default architecture, e.g. by selecting the one that suits a device on the machine building it, defaulting back to sm_50 (or other) if there is no such device. The architecture flag should overwrite this. Just spitballing. 😄

@Pennycook
Copy link
Contributor Author

Makes sense to me, though I'm not sure that is achievable. I vaguely remember the default arch being bumped up to sm_50 due to atomics, so I wouldn't be surprised if floating point (or at least 64-bit floating point) atomics aren't supported before sm_60. If that is the case, as long as the error is clear about the architecture problem I don't think it's necessarily an issue. In that case, this should only become a problem if people use the floating point atomics for hardware that does not support it, otherwise they would just need to add the flag.

You're right that architectures prior to sm_60 won't be able to generate native floating-point atomic instructions, but that doesn't mean that they can't support the compilation path currently enabled by -DSYCL_USE_NATIVE_FP_ATOMICS. All that -DSYCL_USE_NATIVE_FP_ATOMICS does is replace a compare-and-swap loop in the DPC++ headers with a call to __spirv_AtomicFAddExt; it moves the decision of how to implement a floating-point atomic add out of the DPC++ headers and into the device compiler.

I may be wrong about this, but I think that with something like an NVVMReflect pass it would be possible to generate a compare-and-swap loop for sm_50 inside of libspirv. Then __spirv_AtomicFAddExt would give the correct result everywhere, without any special flags.

As for the test, what happens if you add the --cuda-gpu-arch=sm_60 flag to the general command? Does other backends just ignore it? If so, it could be an option, though it is arguably hacky.

Good idea -- I hadn't thought to try this. The flag gets a little longer (because the target has to be specified explicitly), but this seems to work: -Xsycl-target-backend=nvptx64-nvidia-cuda-sycldevice --cuda-gpu-arch=sm_60

I'll defer to @bader on this. I don't know which NVIDIA architecture(s) are used for automated tests, and we might have to change the documentation if some tests are known to fail on older NVIDIA GPUs.

In the future we could consider being smarter about the default architecture, e.g. by selecting the one that suits a device on the machine building it, defaulting back to sm_50 (or other) if there is no such device. The architecture flag should overwrite this. Just spitballing. 😄

Sounds like a good idea to me! I've been developing on sm_60, and didn't realize I was compiling for sm_50 until I encountered
this issue with atomics.

@bader
Copy link
Contributor

bader commented Mar 4, 2021

As for the test, what happens if you add the --cuda-gpu-arch=sm_60 flag to the general command? Does other backends just ignore it? If so, it could be an option, though it is arguably hacky.

Good idea -- I hadn't thought to try this. The flag gets a little longer (because the target has to be specified explicitly), but this seems to work: -Xsycl-target-backend=nvptx64-nvidia-cuda-sycldevice --cuda-gpu-arch=sm_60

I'll defer to @bader on this. I don't know which NVIDIA architecture(s) are used for automated tests, and we might have to change the documentation if some tests are known to fail on older NVIDIA GPUs.

According to CI logs automation system uses Titan RTX, which is sm_75 architecture - http://ci.llvm.intel.com:8010/#/builders/37/builds/7236/steps/8/logs/stdio.

We probably should build the test for multiple targets. Passing CUDA back-end specific option while compiling for any target other than CUDA might lead to the compilation failure - this option will be passed to the device specific back-end compiler.

@steffenlarsen
Copy link
Contributor

I may be wrong about this, but I think that with something like an NVVMReflect pass it would be possible to generate a compare-and-swap loop for sm_50 inside of libspirv. Then __spirv_AtomicFAddExt would give the correct result everywhere, without any special flags.

If that is possible, that would be great! I don't know much about that pass, so I will trust your judgement. We could still consider having a warning though, as people might be tricked by this if they have hardware that supports sm_60+.

If it is not possible, then I think having it fail at compile time will be better as - as far as I remember - it gives a nice error message about the architecture being off and you can adjust it then and there.

Pennycook added a commit to Pennycook/llvm-test-suite that referenced this pull request Mar 5, 2021
Test for functionality from intel/llvm#3276

Signed-off-by: John Pennycook <john.pennycook@intel.com>
@romanovvlad romanovvlad merged commit 78a0b19 into intel:sycl Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants