Closed
Description
Describe the bug
DPC++ compiler for NVidia PTX backend cannot inline atomic_ref::fetch_add function that may lead to ~20% performance degradation vs. nvcc on the same device.
To Reproduce
histograms.zip
Steps to reproduce:
$ unzip histograms.zip
$ cd histograms/cuda
$ export PATH=$PATH:/usr/local/cuda/bin
$ ./build.sh
$ ./hist
Size 23013896 => min: 2.516 ms; avg: 2.637 ms; max: 3.149 ms
$ cd ../dpcpp
$ source <intel/llvm compiler>
$ ./build.sh
$ ./hist
Size 23013896 => min: 2.949 ms; avg: 3.155 ms; max: 3.767 ms
Performance degradation is about 17-20%.
Environment (please complete the following information):
- OS: Ubuntu 20.04
- Target device and vendor: NVidia RTX2070
- DPC++ version: clang version 14.0.0 (https://github.com/intel/llvm e15ac50)
- Dependencies version: CUDA 11.4
Additional context
It was found based on initial investigation, that the major difference between these two binaries are that fact that in case of nvcc atomic operation is fully inlided while for DPC++/clang it is not inlined, call/ret approach is used instead that leads to more instructions executed and higher number of branches.
For DPC++ it was found, that __spirv_AtomicFAddEXT
function call is not inlined (in spriv.hpp):
template <typename T, access::address_space AddressSpace>
__attribute__((always_inline)) typename detail::enable_if_t<std::is_floating_point<T>::value, T>
AtomicFAdd(multi_ptr<T, AddressSpace> MPtr, memory_scope Scope,
memory_order Order, T Value) {
auto *Ptr = MPtr.get();
auto SPIRVOrder = getMemorySemanticsMask(Order);
auto SPIRVScope = getScope(Scope);
return __spirv_AtomicFAddEXT(Ptr, SPIRVScope, SPIRVOrder, Value); // <--- HERE
}