Skip to content

[CUDA] Atomic fetch_add call cannot be inlined #5429

Closed
@anton-v-gorshkov

Description

@anton-v-gorshkov

Describe the bug
DPC++ compiler for NVidia PTX backend cannot inline atomic_ref::fetch_add function that may lead to ~20% performance degradation vs. nvcc on the same device.

To Reproduce
histograms.zip

Steps to reproduce:

$ unzip histograms.zip
$ cd histograms/cuda
$ export PATH=$PATH:/usr/local/cuda/bin
$ ./build.sh
$ ./hist
Size   23013896 => min: 2.516 ms; avg: 2.637 ms; max: 3.149 ms
$ cd ../dpcpp
$ source <intel/llvm compiler>
$ ./build.sh
$ ./hist
Size   23013896 => min: 2.949 ms; avg: 3.155 ms; max: 3.767 ms

Performance degradation is about 17-20%.

Environment (please complete the following information):

Additional context
It was found based on initial investigation, that the major difference between these two binaries are that fact that in case of nvcc atomic operation is fully inlided while for DPC++/clang it is not inlined, call/ret approach is used instead that leads to more instructions executed and higher number of branches.

For DPC++ it was found, that __spirv_AtomicFAddEXT function call is not inlined (in spriv.hpp):

template <typename T, access::address_space AddressSpace>
__attribute__((always_inline)) typename detail::enable_if_t<std::is_floating_point<T>::value, T>
AtomicFAdd(multi_ptr<T, AddressSpace> MPtr, memory_scope Scope,
           memory_order Order, T Value) {
  auto *Ptr = MPtr.get();
  auto SPIRVOrder = getMemorySemanticsMask(Order);
  auto SPIRVScope = getScope(Scope);
  return __spirv_AtomicFAddEXT(Ptr, SPIRVScope, SPIRVOrder, Value); // <--- HERE
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    blockerBlocks important functionalitybugSomething isn't workingcudaCUDA back-endperformancePerformance related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions