[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for LRN

**Describe the bug**
This reproducer is created for enhancing the performance, to achieve higher bandwidth for the SYCL implementation of
LRN primitive on Nvidia.
This reproducer computes the LRN algorithm for forward and then the memory bandwidth is calculated. Similarly, for backward propagation LRN algorithm is computed and memory bandwidth is calculated.

**Reproduce**

For the reproducer code, refer the attachments setup.sh, lrn.cpp and lrn_kernels.hpp

1. Go to the directory having the reproducer files and run the below script to setup the environment.
source setup.sh

2. To compile, run.
clang++ -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda lrn.cpp

3. The above generates the output file. To see the output bandwidth, run
./a.out

**Observed behaviour**

For, N = 2,C = 15,D = 10,H = 16,W = 16, size = 5, ndims = 5,
Propagation : Forward
Alg : LRN

Result,
Total time = 0.000588sec
Total bandwidth = 1.044898 Gb/s

For, N = 2,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5,
Propagation : Backward
Alg : LRN

Result,
Total time = 0.688375 sec
Total bandwidth = 8.925368 Gb/s

Expected behavior
The ideal behavior is to attain the maximum bandwidth (i.e., CLPeak value for Float16 = 251.94 GBPS) for any input size.
For the current reproducer higher bandwidth is expected i.e., a minimum of 100 GBPS is expected.

**Environment**

-OS: Ubuntu 22.04.1 LTS
-Target device and vendor: Nvidia, Tesla T4
-DPC++ version: clang version 15.0.0 (https://github.com/intel/llvm.git 0c7a1e1)
-Dependencies version: Driver Version: 495.29.05 CUDA Version: 11.5

**Additional context**

Currently N, C, D, H, W, size, ndims are hard coded, can be changed as per need..
Attached the source code of the reproducer below
[LRN_Reproducer.zip](https://github.com/intel/llvm/files/10706884/LRN_Reproducer.zip)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for LRN #8292

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for LRN #8292

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions