Skip to content

[SYCL] Poor Performance - Low bandwidth of SYCL kernel for Eltwise Primitive #8050

Closed
@AvijitBag

Description

@AvijitBag

Describe the bug
This reproducer is created for enhancing the performance, to achieve higher bandwidth for the SYCL implementation of
Eltwise primitive on Nvidia.
This reproducer computes the Eltwise abs algorithm with post ops for forward and then the memory bandwidth is calculated. Similarly, for backward propagation relu algorithm is computed and memory bandwidth is calculated.

Reproduce

For the reproducer code, refer the attachments setup.sh, eltwise.cpp, eltwise_kernels.hpp and sycl_post_ops.hpp

  1. Go to the directory having the reproducer files and run the below script to setup the environment.
    source setup.sh

  2. To compile, run.
    clang++ -O2 -fsycl -fsycl-targets=nvptx64-nvidia-cuda eltwise.cpp

  3. The above generates the output file. To see the output bandwidth, run
    ./a.out

Observed behaviour

For, N = 16, C = 64, D = 1, H = 56, W = 56
Propagation : Forward
Alg : Abs
Post-ops: binary : Add

Result,
Total time = 0.004545 s
Total bandwidth = 19.783365 Gb/s


For, N = 50, C = 10, D = 10, H = 10, W = 10 and Alpha = 0.00, Beta = 4.00
Propagation : Backward
Alg : Relu

Result,
Total time = 0.000398 s
Total bandwidth = 15.075377 Gb/s


Expected behavior
The ideal behavior is to attain the maximum bandwidth (i.e., CLPeak value for Float16 = 251.94 GBPS) for any input size.
For the current reproducer higher bandwidth is expected i.e., a minimum of 100 GBPS is expected.

Environment :

  • OS: Ubuntu 22.04.1 LTS
  • Target device and vendor: Nvidia, Tesla T4
  • DPC++ version: clang version 15.0.0 (https://github.com/intel/llvm.git 0c7a1e1)
  • Dependencies version: Driver Version: 495.29.05 CUDA Version: 11.5

Additional context
Note: The inputs dimension for the primitives are hard code and all tensor values are automatically filled.
Currently, for forward abs alg with binary post add is implemented and for backward relu alg is implemented.

The reproducer can be tested by switching the flag forward or backward and changing the dimensions values(N,C,D,H,W) in eltwise.cpp as illustrated below.

In reproducer for Forward, only the N,C,D,H & W values need to be changed in line 29 -33 inside forward function
int64_t N = 16;
int64_t C = 64;
int64_t D = 1;
int64_t H = 56;
int64_t W = 56;

Similar for Backward, only the N,C,D,H & W values need to be changed in line 162 -166 inside backward function
int64_t N = 50;
int64_t C = 10;
int64_t D = 10;
int64_t H = 10;
int64_t W = 10;

And, For Forward & Backward a flag is there, by changing it we can switch forward to backward or vice-versa.(line no :273)

isForward= true; 

Source Files

eltwise_reproducer.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcudaCUDA back-end

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions