Skip to content

[SYCL][DOC][Group sort] Let memory_required methods accept memory_scope::device as a parameter #17238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: sycl
Choose a base branch
from

Conversation

andreyfe1
Copy link
Contributor

It was found that some hardware backends can return the memory size for the whole device much less than just simple num_groups * memory_for_each_work_group.
So, after discussion it was suggested to add a possibility to have memory_scope::device for memory_required of default sorters.

Signed-off-by: Fedorov, Andrey andrey.fedorov@intel.com

@andreyfe1 andreyfe1 requested a review from a team as a code owner February 28, 2025 10:43
[=](sycl::nd_item<1> id) {
auto ptr = acc.get_pointer() + id.get_group(0) * n;

my_sycl::joint_sort(
// create group helper using deduction guides
my_sycl::group_with_scratchpad(
id.get_group(),
sycl::span{scratch.get_pointer(), temp_memory_size}
sycl::span{temp, temp_memory_size}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers.
We pass only a global pointer to the temporary memory. We don't need to split it into work-groups, e.g. temp + id.get_group(0) * n. Underlying backend has to understand how to split the temporary memory between work-groups

[=](sycl::nd_item<1> id) {
auto ptr = acc.get_pointer() + id.get_group(0) * n;

my_sycl::joint_sort(
// create group helper using deduction guides
my_sycl::group_with_scratchpad(
id.get_group(),
sycl::span{scratch.get_pointer(), temp_memory_size}
sycl::span{temp, temp_memory_size}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern regarding how joint_sort can distinguish if the temporary memory size was given using memory_scope::device or memory_scope::work_group.
So, if some implementation defined functionality can be developed to grab this information, it's fine. If not, we need to add some extra parameters for sorting functions calling within the kernel

@andreyfe1
Copy link
Contributor Author

Ping @intel/dpcpp-specification-reviewers

Comment on lines +703 to +706
If `scope == sycl::memory_scope::device`,
`range_size` is the size of the global range for `sycl::nd_range`
that was used to run the kernel;
if `scope == sycl::memory_scope::work_group`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment really applies to the whole pull request, but I'm commenting here so it can be threaded.

The way that I thought this worked was that if a developer was using a work-group to sort one array per work-group (i.e., the first argument to joint_sort is a sycl::group) then they should use memory_scope::work_group, and that if a developer was using a sub-group to sort one array per sub-group (i.e., the first argument to joint_sort is a sycl::sub_group) then they should use memory_scope::sub_group.

The logical extension of that would be that if a developer wants to use the entire device to sort a single array, then the first argument to sycl::joint_sort should be a root_group and they should use memory_scope::device.

But what you are introducing here seems to be a way for developers to query the total amount of memory required by the device when you run independent sort calls within each work-group or sub-group. I can see the appeal of doing that (because the device might be able to re-use memory across work-groups) but if I'm right about what you're trying to do here than I don't think it's a good idea to try and retrofit it into your existing interface.

I think the fact that you're changing what the memory_scope argument means is what leads to your concern in https://github.com/intel/llvm/pull/17238/files#r1975245222 -- there is no way for joint_sort to reason about how the user constructed the scratchpad. I think if you want to expose this functionality, you need to redesign the scratchpad in a way that allows the user to express exactly what they want to use the scratchpad for, and that allows the implementation to track that information until the point where the scratchpad is actually used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pennycook, thanks for your comment! Actually there is no direct connection between where to execute the algo and how temporary memory is allocated. I also see it in SYCL in general, e.g. users can call joint_reduce on data allocated with local or global memory: memory allocation and execution are separated. So, it was just a coincidence for us that memory and execution scopes were intersected.

Regarding revising scratchpad. I'd also support that since it's natural for the memory structure to handle the memory scope of underlined pointer. I'll think how to organize it. I'm also thinking about having the memory scope as a template parameter for sorters. It can be used as following:

using my_joint_sorter = default_sorters::joint_sorter<memory_scope::device>;
...
auto temp_memory_size = my_joint_sorter::memory_required(d, n_elements);
...
// device level
joint_sort(group, first, last, my_joint_sorter{temp});
...

In this case joint_sort knows the memory scope of temp.
Any feedback is welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using my_joint_sorter = default_sorters::joint_sorter<memory_scope::device>;
...
auto temp_memory_size = my_joint_sorter::memory_required(d, n_elements);
...
// device level
joint_sort(group, first, last, my_joint_sorter{temp});
...

In this case joint_sort knows the memory scope of temp. Any feedback is welcome.

This seems a little confusing to me. I'd read joint_sorter<memory_scope::device> as being a sorter that sorts at device-scope, not one that allocates at device-scope -- basically the same problem as before. This is going to get even more confusing when we introduce user-defined groups, because there might not even be a one-to-one mapping between groups and memory scopes.

I think to make this really explicit, you might need to do something like require developers to specify both the execution_scope/Group type and the memory_scope as separate arguments. Group type might be more future-proof:

// Define a sorter that sorts per work-group, using a single allocation at device memory scope
using my_joint_sorter = default_sorters::joint_sorter<sycl::work_group, memory_scope::device>;

// Get the amount of memory required for the single allocation at device memory scope
auto temp_memory_size = my_joint_sorter::memory_required(d, n_elements);
...
// "group" here must be a sycl::work_group because that's what the sorter expects
joint_sort(group, first, last, my_joint_sorter{temp});

vs

// Define a sorter that sorts per work-group, using separate memory per work-group
using my_joint_sorter = default_sorters::joint_sorter<sycl::work_group, memory_scope::work_group>;

// Get the amount of memory required for each work-group
auto temp_memory_size = my_joint_sorter::memory_required(d, n_elements);
...
// "group" here must be a sycl::work_group because that's what the sorter expects
joint_sort(group, first, last, my_joint_sorter{temp});

vs

// Define a sorter that sorts for the whole device, using a single allocation
using my_joint_sorter = default_sorters::joint_sorter<sycl::root_group, memory_scope::device>;

// Get the amount of memory required
auto temp_memory_size = my_joint_sorter::memory_required(d, n_elements);
...
// "group" here must be a sycl::root_group because that's what the sorter expects
joint_sort(group, first, last, my_joint_sorter{temp});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants