[SYCL][CUDA] Optimal way of invoking device functions from SYCL kernel

I'm trying to call a function from inside a SYCL kernel, defined in a different namespace and class. This code is compiled with clang++ built for CUDA backend. 

A drop in performance has been observed after invoking that function from the SYCL kernel, when compared to native CUDA implementation.
What could be the reasons causing the drop in performance and what optimizations could be employed in this scenario to make this operation performant?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][CUDA] Optimal way of invoking device functions from SYCL kernel #6496

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SYCL][CUDA] Optimal way of invoking device functions from SYCL kernel #6496

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions