Description
Is your feature request related to a problem? Please describe
In CUDA, a kernel functor can be copied to a CUDA symbol, which can be marked __const__. This enables the compiler to perform various optimisations including LICM & CSE. DPC++ does not currently support this feature; kernel arguments are always passed into global device memory. This means that significant optimisation opportunities are lost.
An example of the use of CUDA Symbols from Kokkos:
// Copy functor asynchronously from there to constant memory on the device
cudaMemcpyToSymbolAsync(kokkos_impl_cuda_constant_memory_buffer, staging,
sizeof(DriverType), 0, cudaMemcpyHostToDevice,
cudaStream_t(cuda_instance->m_stream));
// Invoke the driver function on the device
(base_t::
get_kernel_func())<<<grid, block, shmem, cuda_instance->m_stream>>>();
We have observed up to 20% improvement in kernel performance from Kokkos' CUDA backend when using symbols, and expect similar improvements in Kokkos' SYCL backend.
Describe the solution you would like
It would be useful for a SYCL user to be able to mark specific kernel functions to have their arguments loaded into a CUDA symbol in constant memory. This feature could be enabled for the SYCL CUDA backend via changes to the Clang driver & SYCL Runtime, and the addition of a new kernel attribute.
Kernel Class
A new attribute on the kernel class (e.g. sycl::kernel_constant_mem
) would inform the Clang driver that the given kernel should read parameters from a CUDA symbol in the constant memory space as opposed to standard kernel parameters in the global memory space.
Driver
The driver should take kernel functions marked sycl::kernel_constant_mem
and replace e.g. ld.param
instructions with ld.const
instructions which point to offsets in a single constant memory symbol.
LLVM Pass
The LLVM pass which performs this conversion should occur as early as possible. Specifically, it should occur before optimisation passes which depend on the const-ness of the kernel parameters.
The driver is also responsible for allocating this single constant device symbol with a stable name which is known to the SYCL runtime.
An additional flag to the compiler (e.g. -fsycl-use-constant-symbols
) informs the compiler whether to perform this LLVM pass & symbol allocation.
SYCL Runtime
If the -fsycl-use-constant-symbols
flag was enabled, and the SYCL Runtime launches a kernel marked with sycl::kernel_constant_mem
, it should first copy the flattened functor to the constant CUDA symbol allocated by the driver. The runtime need not perform any modifications to the kernel, nor does it need to store duplicate 'constant' & 'non-constant' versions, because this is handled by the compiler.
Notes
The proposed solution involves a single constant memory allocation which consumes the entirety of the device's constant memory. When the runtime handles multiple sycl::kernel_constant_mem
kernels, the constant symbol is reused.