Closed
Description
I'm trying to call a function from inside a SYCL kernel, defined in a different namespace and class. This code is compiled with clang++ built for CUDA backend.
A drop in performance has been observed after invoking that function from the SYCL kernel, when compared to native CUDA implementation.
What could be the reasons causing the drop in performance and what optimizations could be employed in this scenario to make this operation performant?