Skip to content

[SYCL] Renew and sync with SYCL2020 the implementation of queue::para… #4352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

v-klochkov
Copy link
Contributor

…llel_for()

Signed-off-by: Vyacheslav N Klochkov vyacheslav.n.klochkov@intel.com

@v-klochkov v-klochkov force-pushed the public_vklochkov_reduction_queue_N_reds branch 4 times, most recently from 59d3253 to 654a003 Compare August 18, 2021 19:55
@v-klochkov
Copy link
Contributor Author

This patch is a DRAFT now because it must NOT be merged as is. It is supposed to be combined with XPTI instrumentation changes handling the code-location that is missing (with this patch) in case of call of queue::parallel_for().

@v-klochkov v-klochkov force-pushed the public_vklochkov_reduction_queue_N_reds branch from 654a003 to 9587d2b Compare September 8, 2021 14:08
@github-actions github-actions bot added the Stale label Mar 8, 2022
@v-klochkov v-klochkov force-pushed the public_vklochkov_reduction_queue_N_reds branch from 9587d2b to a83c328 Compare March 22, 2022 19:35
@github-actions github-actions bot closed this Apr 22, 2022
@v-klochkov v-klochkov reopened this Apr 22, 2022
@v-klochkov v-klochkov force-pushed the public_vklochkov_reduction_queue_N_reds branch from a83c328 to 9d764bc Compare April 22, 2022 03:24
…llel_for()

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
@v-klochkov v-klochkov force-pushed the public_vklochkov_reduction_queue_N_reds branch from 9d764bc to 5106950 Compare April 22, 2022 04:13
@v-klochkov v-klochkov marked this pull request as ready for review April 22, 2022 04:19
@v-klochkov v-klochkov requested a review from a team as a code owner April 22, 2022 04:19
My previous commit in this PR removed it and caused 3 LIT test fails.
I decided to restore those shortcuts accepting 'offset' to to not break
backward compatibility.

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
@v-klochkov
Copy link
Contributor Author

Testing is passed except one build failed with unrelated linkage error:
"pi_level_zero.cpp:(.text.piextProgramGetNativeHandle+0x54): undefined reference to `pthread_rwlock_rdlock'"

@tovinkere - please review, thank you.

@v-klochkov
Copy link
Contributor Author

@tovinkere - from XPTI + performance point of view, would it worth having additional code-duplication and instead of having Variant1 (as in this PR now) have Variant2 with additional code-duplication:

Current/Previous code:
template <typename KernelName = detail::auto_name, typename KernelType> event parallel_for(range<1> NumWorkItems, _KERNELFUNCPARAM(KernelFunc) _CODELOCPARAM(&CodeLoc));

Variant1 (proposed in the current PR):
template <typename KernelName = detail::auto_name, typename... RestT> event parallel_for(range<1> Range, RestT &&...Rest);

Variant2 (with code-duplication. parallel_for without reduction still has CodeLoc parameter):

 template <typename KernelName = detail::auto_name, typename KernelType>
  event parallel_for(range<1> NumWorkItems,
                     _KERNELFUNCPARAM(KernelFunc) _CODELOCPARAM(&CodeLoc));

  template <typename KernelName = detail::auto_name, typename... RestT>
  std::enable_if_t<(sizeof...(RestT) > 1 && ext::oneapi::detail::AreAllButLastReductions<RestT...>::value), event>
  parallel_for(range<1> Range, RestT &&...Rest);

If having explicit CodeLoc is noticably better/faster, then Variant2 makes sense.
Please comment/answer this question.

steffenlarsen
steffenlarsen previously approved these changes Apr 28, 2022
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tovinkere
Copy link
Contributor

@tovinkere - from XPTI + performance point of view, would it worth having additional code-duplication and instead of having Variant1 (as in this PR now) have Variant2 with additional code-duplication:

Current/Previous code: template <typename KernelName = detail::auto_name, typename KernelType> event parallel_for(range<1> NumWorkItems, _KERNELFUNCPARAM(KernelFunc) _CODELOCPARAM(&CodeLoc));

Variant1 (proposed in the current PR): template <typename KernelName = detail::auto_name, typename... RestT> event parallel_for(range<1> Range, RestT &&...Rest);

Variant2 (with code-duplication. parallel_for without reduction still has CodeLoc parameter):

 template <typename KernelName = detail::auto_name, typename KernelType>
  event parallel_for(range<1> NumWorkItems,
                     _KERNELFUNCPARAM(KernelFunc) _CODELOCPARAM(&CodeLoc));

  template <typename KernelName = detail::auto_name, typename... RestT>
  std::enable_if_t<(sizeof...(RestT) > 1 && ext::oneapi::detail::AreAllButLastReductions<RestT...>::value), event>
  parallel_for(range<1> Range, RestT &&...Rest);

If having explicit CodeLoc is noticably better/faster, then Variant2 makes sense. Please comment/answer this question.

Explicit CodeLoc is always faster as the capture happens at compile time. I haven't measured the cost of retrieving it from KernelInfo object, so I cannot comment on the overall perf difference.

std::enable_if_t<
ext::oneapi::detail::AreAllButLastReductions<RestT...>::value, event>
parallel_for(nd_range<Dims> Range, RestT &&...Rest) {
const detail::code_location CodeLoc = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are sending a dummy code location for now, so can you please add a comment here that says "Actual code location needs to be captured from KernelInfo object" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the comment in this additional commit: fd064e2
In the process of adding the comment, I noticed some redundancy in parallel_for_impl accepting 'offset' (deprecated shortcuts)

The 3 deprecated shortcuts accepting 'offset' actually never get
code-location argument, thus removed code-loc from arguments there.

Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
_KERNELFUNCPARAM(KernelFunc) _CODELOCPARAM(&CodeLoc)) {
_CODELOCARG(&CodeLoc);
__SYCL2020_DEPRECATED("offsets are deprecated in SYCL 2020")
event parallel_for_impl(range<Dims> Range, id<Dims> WorkItemOffset,
Copy link
Contributor

@againull againull Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slava, did you intentionally change parallel_for to parallel_for_impl here?
It looks like you wanted to deprecate this API but still leave it as parallel_for because someone might still be using it.

Copy link
Contributor Author

@v-klochkov v-klochkov Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did that intentionally. The main entry point for this args-configuration is the func at the line 761.
It would be ambiguous call/functions error without renaming this code here to parallel_for_impl.

Because the entry point is at 761 that does not accept code-loc in the argument, I had to remove CODELOCPARAM(&CodeLoc) at L896

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

someone might still be using it

May be someone is using it, maybe not, but for sure we have 3 LIT test using parallel_for with 'offset'.
Fails on those tests made me rename this function to parallel_for_impl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you!

Copy link
Contributor

@tovinkere tovinkere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@againull againull merged commit e59fb89 into intel:sycl Apr 29, 2022
@v-klochkov v-klochkov deleted the public_vklochkov_reduction_queue_N_reds branch February 24, 2023 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants