Skip to content

[SYCL] Always inline kernel lambda operator in entry point #6977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 12, 2022

Conversation

npmiller
Copy link
Contributor

@npmiller npmiller commented Oct 6, 2022

This patch marks the operator() of the kernel lambda as always_inline so that it gets inlined into the kernel entry point.

Kernel entry point are functions that take the captured variables as parameters, create a lambda object from that, setup the index structs and then call operator() on the lambda. Inlining the operator into the entry point should be beneficial in most cases as it allows the compiler to optimize out the lambda creation, which can be very important for kernels capturing a lot of variables.

In a lot of cases the inliner will already do it, but when it doesn't it can lead to very confusing performance implications since the kernel entry point isn't directly visible to users.

Because the always inliner runs very early this patch broke a number of lit tests that were checking for the operator function, I believe I've managed to fix most of them while maintaining the spirit of the test, but some reviews and/or suggestions on these would be appreciated.

@npmiller npmiller requested a review from a team as a code owner October 6, 2022 09:31
Copy link
Contributor

@Fznamznon Fznamznon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch overall looks ok, but I'm not an expert in optimizations matter. We actually have a bunch of early optimizations enabled for device code in clang, they should be able to deal with the inline, don't they?

@premanandrao
Copy link
Contributor

I too am okay with the patch. If we want to examine the kernel body function in the future for some tests, is there anything you would suggest as a command-line option to get (close to) the previous behavior?

@npmiller
Copy link
Contributor Author

npmiller commented Oct 6, 2022

We actually have a bunch of early optimizations enabled for device code in clang, they should be able to deal with the inline, don't they?

Yes, and they do in a lot of cases but not always, it just comes down to the regular inlining heuristics.

The specific issue I ran into was a kernel that was capturing a lot of variables and just calling a device function with all these variables as parameters, which is a very common pattern in SYCL code.

And so because of the number of parameters the creation of the lambda object was pretty expensive, and the function call itself was also pretty expensive, and in this scenario inlining everything gave better performance.

However the device function being pretty large the inliner decided not to inline it (I'm also currently looking into tweaking the inlining heuristics for that), but as a workaround I simply added the always_inline attribute on the device function to force its inlining. But then what happened is that the device function got inlined into the operator() which meant that the operator was now as large as the device function, and so the inliner decided to not inline it. And so with or without the always_inline attribute on my device function I would get the exact same performance because it would never get inlined all the way into the kernel entry point.

If we want to examine the kernel body function in the future for some tests, is there anything you would suggest as a command-line option to get (close to) the previous behavior?

As far as I know there's no way to disable the always inliner, but that's a really good point, I could add a flag to disable this, and that way I could also just add it to existing lit tests that need it rather than having to change them.

@premanandrao
Copy link
Contributor

As far as I know there's no way to disable the always inliner, but that's a really good point, I could add a flag to disable this, and that way I could also just add it to existing lit tests that need it rather than having to change them.

Thanks, would appreciate that.

@npmiller npmiller requested review from a team as code owners October 11, 2022 16:53
@npmiller
Copy link
Contributor Author

I've updated the patch as follows:

  • Add -f[no]-sycl-force-inline-kernel-lambda option enabled by default
  • Revert modifications to the lit tests and just fix them by using the [no] variant of the option
  • Added test for the option

In addition I've also investigated the ESIMD failures from the CI and it seems inlining that early causes issues with the ESIMD attributes. Currently the attribute is propagated to the kernel in the IR passes, I've attempted to propagate it in SemaSYCL so that it's handled before the inlining but that causes issues with the ESIMD validator. In addition it seems that the ESIMD IR passes already force inline the entire kernel call tree so I've decided to simply not take into account the new flag for ESIMD which should fix all the issues from the CI.

@@ -107,6 +107,12 @@ and not recommended to use in production environment.
* nd_item class get_global_id()/get_global_linear_id() member functions
Enabled by default.

**`-f[no]sycl-force-inline-kernel-lambda`**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**`-f[no]sycl-force-inline-kernel-lambda`**
**`-f[no-]sycl-force-inline-kernel-lambda`**

This patch marks the `operator()` of the kernel lambda as
`always_inline` so that it gets inlined into the kernel entry point.

Kernel entry point are functions that take the captured variables as
parameters, create a lambda object from that, setup the index structs
and then call `operator()` on the lambda. Inlining the operator into the
entry point should be beneficial in most cases as it allows the compiler
to optimize out the lambda creation, which can be very important for
kernels capturing a lot of variables.

In a lot of cases the inliner will already do it, but when it doesn't it
can lead to very confusing performance implications since the kernel
entry point isn't directly visible to users.
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYCL docs LGTM!

@bader bader requested review from mdtoguchi and Fznamznon October 12, 2022 09:55
@pvchupin pvchupin merged commit b91b732 into intel:sycl Oct 12, 2022
@pvchupin
Copy link
Contributor

@npmiller, please look into post commit issue on windows: https://github.com/intel/llvm/actions/runs/3236643294/jobs/5302734155

Failed Tests (1):
  Clang :: SemaSYCL/sycl-force-inline-kernel-lambda.cpp

steffenlarsen pushed a commit that referenced this pull request Oct 13, 2022
Without the target flag it was mangling the names differently on
windows, and so breaking the check, simply always generate IR for SPIR
target instead.

This patch fixes the post-commit issue on Windows reported after:
* #6977
@premanandrao
Copy link
Contributor

@npmiller, we have internal reports that this change perceivably affects debugging at -O0 levels. What do you think of disabling this inlining at -O0?

@npmiller
Copy link
Contributor Author

@npmiller, we have internal reports that this change perceivably affects debugging at -O0 levels. What do you think of disabling this inlining at -O0?

@premanandrao That seems reasonable, disabling it at -O0 shouldn't cause any issues

pvchupin pushed a commit that referenced this pull request Dec 2, 2022
PR #6977 enabled always inlining kernel lambda operators.
This PR disables this at -O0 as it was leading to a poor
debugging experience.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants