Description
Hello
There seems a performance gap between the CUDA and SYCL programs on an NVIDIA A100 GPU.
I tried Syclomatic, but the translation was not successful.
https://github.com/zjin-lcf/HeCBench/tree/master/src/scatter-cuda
CUDA (12.5)
./main 10000000 100
INT32 scatter (mul, div, sum, min, max)
Average execution time of kernel: 609.347046 (us)
Average execution time of kernel: 513.615234 (us)
Average execution time of kernel: 224.066589 (us)
Average execution time of kernel: 224.341263 (us)
Average execution time of kernel: 224.259125 (us)
https://github.com/zjin-lcf/HeCBench/tree/master/src/scatter-sycl
SYCL (icpx 2025.0.0)
./main 10000000 100
INT32 scatter (mul, div, sum, min, max)
Average execution time of kernel: 5594.654785 (us)
Average execution time of kernel: 5526.372559 (us)
Average execution time of kernel: 5501.559570 (us)
Average execution time of kernel: 5502.131348 (us)
Average execution time of kernel: 5501.163086 (us)