Skip to content

[SYCL-MLIR] Add KernelDisjointSpecialization pass in pipeline #9187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Apr 24, 2023

Verified that with this PR and -D__SYCL_DISABLE_PARALLEL_FOR_RANGE_ROUNDING__, SYCL-MLIR is able to perform scalar replacement on reduction loop on the SYCL-Bench workloads identified before, even with the fix in #9055.

O3 Previously measured gains Only PR9187 PR9187+PR9055
2mm 12% 12% 12%
3mm 13% 12% 11%
covariance 50% 50% 0%
gemm 12% 12% 11%
gramschmidt 14% + 18% (LICM versioning) 33% 33%
syrk 5% 5% 5%

=> No performance regressions with just this PR.
=> Lost covariance 50% gain after specializing the function, because 2 of the 3 accessors actually overlap!

auto data = data_buffer.get_access<access::mode::read>(cgh);
auto symmat = symmat_buffer.get_access<access::mode::discard_write>(cgh);
auto symmat2 = symmat_buffer.get_access<access::mode::discard_write>(cgh);

Not sure why it is written that way, with the simple source code change below (which remove symmat2 and always use symmat), the 50% gain is recovered.

+++ b/polybench/covariance.cpp
@@ -110,7 +110,6 @@ public:
-                       auto symmat2 = symmat_buffer.get_access<access::mode::discard_write>(cgh);
@@ -122,7 +121,7 @@ public:
-                                       symmat2[{j2, j1}] = symmat[{j1, j2}];
+                                       symmat[{j2, j1}] = symmat[{j1, j2}];

If we want to get the gain without source code change, then we need to version by only checking if symmat and data overlap, which we need the context, we may want to perform loop versioning in detect reduction pass.
Notice the inner loop with the opportunity doesn't use symmat2:

for(size_t i = 1; i <= N_; i++)
  symmat[{j1, j2}] += data[{i, j1}] * data[{i, j2}];

Note: A number of KernelFusion test cases are moved to xfail, as accessors cannot be internalize when they are used by a ptrtoint operation. (#9188)

@etiotto
Copy link

etiotto commented Apr 25, 2023

"If we want to get the gain without source code change, then we need to version by only checking if symmat and data overlap, which we need the context, we may want to perform loop versioning in detect reduction pass."

Yes I think we need to version the reduction loop to get the opportunity.

Signed-off-by: Tsang, Whitney <whitney.tsang@intel.com>
@whitneywhtsang whitneywhtsang force-pushed the kernel-disjoint-specialization branch from 52aff1d to 497e738 Compare April 25, 2023 16:52
Signed-off-by: Tsang, Whitney <whitney.tsang@intel.com>
@whitneywhtsang
Copy link
Contributor Author

whitneywhtsang commented Apr 26, 2023

After allowing functions indirectly called by GPU kernel in #9194, there are more KernelFusion tests failed due to unable to internalized. There are a total of 15 KernelFusion test cases moved to xfail.

@whitneywhtsang whitneywhtsang merged commit ee50932 into intel:sycl-mlir Apr 26, 2023
@whitneywhtsang whitneywhtsang deleted the kernel-disjoint-specialization branch April 26, 2023 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sycl-mlir Pull requests or issues for sycl-mlir branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants