-
Notifications
You must be signed in to change notification settings - Fork 798
[SYCL][DOC][Group sort] Let memory_required methods accept memory_scope::device as a parameter #17238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
[=](sycl::nd_item<1> id) { | ||
auto ptr = acc.get_pointer() + id.get_group(0) * n; | ||
|
||
my_sycl::joint_sort( | ||
// create group helper using deduction guides | ||
my_sycl::group_with_scratchpad( | ||
id.get_group(), | ||
sycl::span{scratch.get_pointer(), temp_memory_size} | ||
sycl::span{temp, temp_memory_size} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reviewers.
We pass only a global pointer to the temporary memory. We don't need to split it into work-groups, e.g. temp + id.get_group(0) * n
. Underlying backend has to understand how to split the temporary memory between work-groups
[=](sycl::nd_item<1> id) { | ||
auto ptr = acc.get_pointer() + id.get_group(0) * n; | ||
|
||
my_sycl::joint_sort( | ||
// create group helper using deduction guides | ||
my_sycl::group_with_scratchpad( | ||
id.get_group(), | ||
sycl::span{scratch.get_pointer(), temp_memory_size} | ||
sycl::span{temp, temp_memory_size} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern regarding how joint_sort
can distinguish if the temporary memory size was given using memory_scope::device
or memory_scope::work_group
.
So, if some implementation defined functionality can be developed to grab this information, it's fine. If not, we need to add some extra parameters for sorting functions calling within the kernel
Ping @intel/dpcpp-specification-reviewers |
If `scope == sycl::memory_scope::device`, | ||
`range_size` is the size of the global range for `sycl::nd_range` | ||
that was used to run the kernel; | ||
if `scope == sycl::memory_scope::work_group`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment really applies to the whole pull request, but I'm commenting here so it can be threaded.
The way that I thought this worked was that if a developer was using a work-group to sort one array per work-group (i.e., the first argument to joint_sort
is a sycl::group
) then they should use memory_scope::work_group
, and that if a developer was using a sub-group to sort one array per sub-group (i.e., the first argument to joint_sort
is a sycl::sub_group
) then they should use memory_scope::sub_group
.
The logical extension of that would be that if a developer wants to use the entire device to sort a single array, then the first argument to sycl::joint_sort
should be a root_group
and they should use memory_scope::device
.
But what you are introducing here seems to be a way for developers to query the total amount of memory required by the device when you run independent sort calls within each work-group or sub-group. I can see the appeal of doing that (because the device might be able to re-use memory across work-groups) but if I'm right about what you're trying to do here than I don't think it's a good idea to try and retrofit it into your existing interface.
I think the fact that you're changing what the memory_scope
argument means is what leads to your concern in https://github.com/intel/llvm/pull/17238/files#r1975245222 -- there is no way for joint_sort
to reason about how the user constructed the scratchpad. I think if you want to expose this functionality, you need to redesign the scratchpad in a way that allows the user to express exactly what they want to use the scratchpad for, and that allows the implementation to track that information until the point where the scratchpad is actually used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Pennycook, thanks for your comment! Actually there is no direct connection between where to execute the algo and how temporary memory is allocated. I also see it in SYCL in general, e.g. users can call joint_reduce
on data allocated with local or global memory: memory allocation and execution are separated. So, it was just a coincidence for us that memory and execution scopes were intersected.
Regarding revising scratchpad. I'd also support that since it's natural for the memory structure to handle the memory scope of underlined pointer. I'll think how to organize it. I'm also thinking about having the memory scope as a template parameter for sorters. It can be used as following:
using my_joint_sorter = default_sorters::joint_sorter<memory_scope::device>;
...
auto temp_memory_size = my_joint_sorter::memory_required(d, n_elements);
...
// device level
joint_sort(group, first, last, my_joint_sorter{temp});
...
In this case joint_sort
knows the memory scope of temp
.
Any feedback is welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using my_joint_sorter = default_sorters::joint_sorter<memory_scope::device>; ... auto temp_memory_size = my_joint_sorter::memory_required(d, n_elements); ... // device level joint_sort(group, first, last, my_joint_sorter{temp}); ...In this case
joint_sort
knows the memory scope oftemp
. Any feedback is welcome.
This seems a little confusing to me. I'd read joint_sorter<memory_scope::device>
as being a sorter that sorts at device-scope, not one that allocates at device-scope -- basically the same problem as before. This is going to get even more confusing when we introduce user-defined groups, because there might not even be a one-to-one mapping between groups and memory scopes.
I think to make this really explicit, you might need to do something like require developers to specify both the execution_scope
/Group
type and the memory_scope
as separate arguments. Group
type might be more future-proof:
// Define a sorter that sorts per work-group, using a single allocation at device memory scope
using my_joint_sorter = default_sorters::joint_sorter<sycl::work_group, memory_scope::device>;
// Get the amount of memory required for the single allocation at device memory scope
auto temp_memory_size = my_joint_sorter::memory_required(d, n_elements);
...
// "group" here must be a sycl::work_group because that's what the sorter expects
joint_sort(group, first, last, my_joint_sorter{temp});
vs
// Define a sorter that sorts per work-group, using separate memory per work-group
using my_joint_sorter = default_sorters::joint_sorter<sycl::work_group, memory_scope::work_group>;
// Get the amount of memory required for each work-group
auto temp_memory_size = my_joint_sorter::memory_required(d, n_elements);
...
// "group" here must be a sycl::work_group because that's what the sorter expects
joint_sort(group, first, last, my_joint_sorter{temp});
vs
// Define a sorter that sorts for the whole device, using a single allocation
using my_joint_sorter = default_sorters::joint_sorter<sycl::root_group, memory_scope::device>;
// Get the amount of memory required
auto temp_memory_size = my_joint_sorter::memory_required(d, n_elements);
...
// "group" here must be a sycl::root_group because that's what the sorter expects
joint_sort(group, first, last, my_joint_sorter{temp});
It was found that some hardware backends can return the memory size for the whole device much less than just simple
num_groups * memory_for_each_work_group
.So, after discussion it was suggested to add a possibility to have
memory_scope::device
formemory_required
of default sorters.Signed-off-by: Fedorov, Andrey andrey.fedorov@intel.com