-
Notifications
You must be signed in to change notification settings - Fork 787
[DOC] Extend group sort algorithms with interfaces to scratchpad memory #3989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Fedorov, Andrey <andrey.fedorov@intel.com>
Questions to discuss:
|
Signed-off-by: Fedorov, Andrey <andrey.fedorov@intel.com>
Let's also think about names for memory helper. Can we imagine better names? Inviting @Pennycook, @gmlueck, @akukanov to the discussion of the PR |
I'm still in favor of a policy-like object, because I think it's more aligned with the executors proposal. Looking forward, I'd expect all C++ algorithms (including SYCL algorithms) to accept a single argument encapsulating all information about the resources that an algorithm can use for execution. A policy-like object also scales better in terms of the number of overloads and cognitive burden: if it turns out that we need to pass something else to the algorithm as well (e.g. that the group has a known fixed size and that it can use some scratch space) we can modify the policy rather than adding more arguments to the algorithm.
UB seems reasonable to me. We could add a non-normative note saying that this may change in future revisions of the specification, if you want to keep the discussion going. I don't think we can really have good error behavior unless the core SYCL specification introduces a mechanism to handle run-time errors on the device.
I think something like template</*unspecified*/>
static std::size_t memory_required(std::size_t range); // memory required for joint_sort on range
template</*unspecified*/>
static std::size_t memory_required(Group g, sycl::range<D> local_range); // memory required for sort_over_group on group with specified local range |
@Pennycook, thanks for the opinion. I see your point. Seems, it's better to speak about executor-like object, not policy-like as an alternative for adding arguments to functions. From my point of view policy is often light-weight object that doesn't contain everything that algorithms need. I like the idea with overloading for template</*unspecified*/>
static std::size_t memory_required(std::size_t range); // memory required for joint_sort on range
template</*unspecified*/>
static std::size_t memory_required(sycl::range<dim> local_range); // memory required for sort_over_group on groups with specified local range Just to consider other names for function to have a choice: |
That's a good point. I'll say executor-like going forward.
Ah, I'd forgotten about that. I think we can't just remove the argument completely, unfortunately -- some implementations may need the user to provide some additional memory for work-groups and sub-groups (or maybe even for other group types introduced by other extensions). Saying this now, I realize that we might need the group argument for Would it make sense to add the group type as a template argument? template<typename Group, /*unspecified*/>
static std::size_t memory_required(std::size_t range); // memory required for joint_sort on range
template<typename Group, /*unspecified*/>
static std::size_t memory_required(sycl::range<dim> local_range); // memory required for sort_over_group on groups with specified local range |
It makes sense. Implementation for template<typename Group, /*unspecified*/>
static std::size_t memory_required(Group, std::size_t range); // memory required for joint_sort on range
template<typename Group, /*unspecified*/>
static std::size_t memory_required(Group, sycl::range<dim> local_range); // memory required for sort_over_group on groups with specified local range Thus, we can specify it as following template</*unspecified*/>
static std::size_t memory_required(sycl::group, std::size_t range); // for sycl::group
template</*unspecified*/>
static std::size_t memory_required(sycl::sub_group, std::size_t range); // for sycl::sub_group |
You're right. Sorry, I always forget that. Doesn't what you've written there require I can see two solutions, neither of which is perfect:
|
I see, it's the same trap with |
I think using One way to interpret the The designers of future extensions would then have two options when introducing a new group type:
FWIW, I think 1. is cleaner, but we don't have to solve that problem here. All that said... Are there any other arguments to |
As I checked. no. We have difference by |
Ok, great. Then I think |
Great! I'll try to make a prototype for executor-like objects then we'll see how it looks like. Any suggestions for the name of a type are appreciated |
The only suggestion I have is to use |
Added "executor-like" object to the Spec and modified functions for temporary memory calculation. Please, take a look how it looks like now. I've added group executors only for algorithms' overloads that doesn't have Sorter parameter since we can pass memory directly to Sorter objects. |
Having the exact number and types of parameters for The problem I see with standalone memory helper functions is that those are proposed specifically for sorting, which seemingly prevents using/overloading those to query memory requirements of other algorithms. The approach with executor-like objects is preferred on the ground of extensibility for other algorithms, so the memory helper functions should be designed with at least the same degree of extensibility. |
@akukanov, thanks for a reminder about portability across implementations. However, it's quite difficult to realize how default sorting can be implemented. I'll think about minimal information.
I thought about it. I think names of functions should reflect the algorithm name and should be included into the name, e.g. |
Regarding minimal information we need for memory helpers of default sorter. static std::size_t memory_required(sycl::memory_scope scope, std::size_t range_size);
template<int dimensions = 1>
static std::size_t memory_required(sycl::memory_scope scope, sycl::range<dimensions> local_range);
|
It's better, but I would then go with the names that start exactly the same as the algorithm function names, i.e. Also, do we really need external memory for |
Unfortunately, yes. The thing is the size of local range for Regarding names. Having full algorithm name ( size_t sort_size = sort_over_group_memory_required(10);
size_t reduce_size = reduce_over_group_memory_required(10); The difference is just in 1st word of 5. @Pennycook, I'd like to hear your opinion on function names. |
Adding the name of the algorithm to the function name seems unnecessary to me. I might be confused, but I thought that the idea was these functions were already part of a class like I was imagining that a user would write something like this, for sorting: size_t scratch_size = default_sorter<int>::memory_required(memory_scope::work_group, work_group_size);
auto scratch = malloc_device(scratch_size, q);
...
auto exec = group_executor(wg, span(scratch, scratch_size));
auto s = sort_over_group(exec, x, default_sorter<int>()); ...and for reducing (NB: "reducer" is already used in the spec, so we'd need a different name): size_t scratch_size = default_reducer<int>::memory_required(memory_scope::work_group, work_group_size);
auto scratch = malloc_device(scratch_size, q);
...
auto exec = group_executor(wg, span(scratch, scratch_size));
auto s = reduce_over_group(exec, x, plus<>(), default_reducer<int>()); I like the overload solution for Would it make sense to attach the |
For sorters, yes. But what if users want to get memory size for sorting functions without sorters? There are standalone functions. See examples: https://github.com/intel/llvm/pull/3989/files#diff-5f58a59c39f7c062697cd495199320cc1b8f9d6823834cf61ba37fae5347feb6R484 |
Oh, I see. Sorry, I'd missed that. I assumed that calling the |
Won't it be confusing if we recommend to use |
I think some of this can be solved by documentation. Some of the C++ algorithm overloads are already defined to behave as-if they have default arguments (for example, |
Signed-off-by: Fedorov, Andrey <andrey.fedorov@intel.com>
After the discussion with @rarutyun and @Pennycook we agreed on following:
I'll make corresponding changes soon. |
Signed-off-by: Fedorov, Andrey <andrey.fedorov@intel.com>
Signed-off-by: Fedorov, Andrey <andrey.fedorov@intel.com>
Moved |
sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc
Outdated
Show resolved
Hide resolved
It was found that local memory is not enough to process a lot of elements by one work group. So, it's better to let users choose which memory they can use: local or global. |
Seems, interfaces are not changing to let users provide global memory |
Recently @akukanov during the offline discussion mentioned that it's better to not associate class, which encapsulates group and memory, with C++ executors. Proposal for C++ executors is not finalized and potentially it can be another thing that we think about it now. I'm trying to imagine a name for the concept and a class. Currently it's "Group helpers". I'm not sure it's a good name. So, feel free to suggest a better name. Some suggestions: |
I'm not opposed to calling the class something else, but I think we should still signal in the extension that we're aware of executors (perhaps by linking to https://wg21.link/p0443) and that our class is trying to achieve something similar. Once executors are finalized, it's highly likely that we'd modify the class introduced here to model the concept or at least try to get very close. We've talked about how this class might evolve to support additional information beyond just memory, so I think we should still refer to the template argument as something generic like an Tagging @gmlueck, @rolandschulz, @mkinsner, @jbrodman in case they have any ideas or just fancy some bikeshedding. |
From the ISO C++ SG1+LEWG meeting yesterday, this is more http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2300r1.html which is fashionable nowadays... |
Thanks, @keryell. @capatober, based on this... I think we should describe the first argument as an "execution context" and point to this paper. We can say in the introduction or some other non-normative section that:
What do you think? |
I would like to add my 2 cents. Since I am regularly participating in LEWG and also working on the Executors that we probably should not very aggressively react on any change that is going on in C++ standard committee. As far as I can tell P2300 is intended to be the evolution of P0443 but in fact it is not so far. The discussion is still in progress and we don't know for sure what outcome would be. I am saying that we definitely should follow what is going on with Executors (and basically we are doing that with my and other people efforts) but linking to the paper might be undesirable in our context because things are changing quite quickly including the terminology. I think we should operate our terms for Group Sort to achieve what we need keeping in mind Executors to be able to develop as close model to them as possible in future but don't try to use the terminology and the paradigms from there too much.
Execution context basically represents the some hardware unit (e.g. CPU, GPU) to my understanding. Could you please elaborate what do you mean by your first bullet?
Completely agree from asynchrony standpoint. However, scheduler is the representation of Execution Context so, I would like to see the answer on my previous question to make sure we are on the same page. Anyway, I agree that we should not overcomplicate our initial proposal, especially keeping in mind that Executors are not stable yet.
Dead sure. |
I agree we do not need to track too closely uncommitted ISO C++ discussions. |
Thanks for clarifying the status of this paper. If you think it's too early to link to it, we shouldn't link to it.
I may have misunderstood the authors' intent, or be reading too much between the lines. Section 4.1 says:
In the case of a "specific resource", something like a CPU or GPU has more properties than the proposal discusses (e.g. a fixed amount of hardware parallelism, some finite amount of memory, etc). So, my thinking is that a "more abstract one" could include other properties too (e.g. the current thread of execution with this memory pool). The proposal seems deliberately vague about what an execution context is or isn't -- since a user must create a scheduler in order to interact with the other functions it introduces -- but there seems to be room to interpret things the way I have. |
Basing on that I'd like to clarify which proposal can be for Group Sort extension. I wouldn't suggest to make a strong connection between Executors proposal and class that encapsulates group and memory because (as it's mentioned above) things are changing very quickly. We can duplicate some ideas here that we like and that make sense. Then align with Executors proposal when it's finalized, later. Let's choose a name |
Signed-off-by: Fedorov, Andrey <andrey.fedorov@intel.com>
@capatober, could you confirm that I can PR title and description for the commit message title/description? |
I think title and description are valid despite the multiple changes here. I don't see what can prevent us from having it for the commit message. Let me know if some additional [tags] for title are needed for the commit message |
Previously it was found that we need an additional local memory for more performant implementation. The proposal fixes it.
Previous version of proposal was discussed here: #3514
Design review: #3754
Signed-off-by: Fedorov, Andrey andrey.fedorov@intel.com