-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Renew and sync with SYCL2020 the implementation of queue::para… #4352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Renew and sync with SYCL2020 the implementation of queue::para… #4352
Conversation
59d3253
to
654a003
Compare
This patch is a DRAFT now because it must NOT be merged as is. It is supposed to be combined with XPTI instrumentation changes handling the code-location that is missing (with this patch) in case of call of queue::parallel_for(). |
654a003
to
9587d2b
Compare
9587d2b
to
a83c328
Compare
a83c328
to
9d764bc
Compare
…llel_for() Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
9d764bc
to
5106950
Compare
My previous commit in this PR removed it and caused 3 LIT test fails. I decided to restore those shortcuts accepting 'offset' to to not break backward compatibility. Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
…reduction_queue_N_reds
Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
…reduction_queue_N_reds
…reduction_queue_N_reds
Testing is passed except one build failed with unrelated linkage error: @tovinkere - please review, thank you. |
…reduction_queue_N_reds
@tovinkere - from XPTI + performance point of view, would it worth having additional code-duplication and instead of having Variant1 (as in this PR now) have Variant2 with additional code-duplication: Current/Previous code: Variant1 (proposed in the current PR): Variant2 (with code-duplication. parallel_for without reduction still has CodeLoc parameter):
If having explicit CodeLoc is noticably better/faster, then Variant2 makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Explicit CodeLoc is always faster as the capture happens at compile time. I haven't measured the cost of retrieving it from KernelInfo object, so I cannot comment on the overall perf difference. |
std::enable_if_t< | ||
ext::oneapi::detail::AreAllButLastReductions<RestT...>::value, event> | ||
parallel_for(nd_range<Dims> Range, RestT &&...Rest) { | ||
const detail::code_location CodeLoc = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are sending a dummy code location for now, so can you please add a comment here that says "Actual code location needs to be captured from KernelInfo object" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the comment in this additional commit: fd064e2
In the process of adding the comment, I noticed some redundancy in parallel_for_impl accepting 'offset' (deprecated shortcuts)
…reduction_queue_N_reds
The 3 deprecated shortcuts accepting 'offset' actually never get code-location argument, thus removed code-loc from arguments there. Signed-off-by: Vyacheslav N Klochkov <vyacheslav.n.klochkov@intel.com>
_KERNELFUNCPARAM(KernelFunc) _CODELOCPARAM(&CodeLoc)) { | ||
_CODELOCARG(&CodeLoc); | ||
__SYCL2020_DEPRECATED("offsets are deprecated in SYCL 2020") | ||
event parallel_for_impl(range<Dims> Range, id<Dims> WorkItemOffset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slava, did you intentionally change parallel_for to parallel_for_impl here?
It looks like you wanted to deprecate this API but still leave it as parallel_for because someone might still be using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did that intentionally. The main entry point for this args-configuration is the func at the line 761.
It would be ambiguous call/functions error without renaming this code here to parallel_for_impl.
Because the entry point is at 761 that does not accept code-loc in the argument, I had to remove CODELOCPARAM(&CodeLoc) at L896
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
someone might still be using it
May be someone is using it, maybe not, but for sure we have 3 LIT test using parallel_for with 'offset'.
Fails on those tests made me rename this function to parallel_for_impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…llel_for()
Signed-off-by: Vyacheslav N Klochkov vyacheslav.n.klochkov@intel.com