Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Add reduction overloads accepting span #6019

Merged
merged 22 commits into from
May 2, 2022
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions sycl/include/CL/sycl/reduction.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -100,5 +100,57 @@ reduction(T *Var, const T &Identity, BinaryOperation Combiner,
return {Var, Identity, Combiner, InitializeToIdentity};
}

#if __cplusplus >= 201703L
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is there a better way to define the specialization of reducer to avoid potential ABI breakage in the future?
    This implementation works by privatizing the entire span on each work-item, which will fail for sufficiently large spans. Eventually we'll need to choose a different reducer implementation based on the Extent parameter. Should reducer and reducer_element be SFINAE'd away for certain Extent values, even in this first implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a specific scenario in mind where this would break ABI? Seems to me like this is all in headers and since the extent is a template parameter I am not sure if user-code incompatibility should be much of a concern.
That said, if we know what Extent we need to stop at, I don't mind an enable_if masking the definition out or alternatively a static_assert telling the user that the given Extent isn't yet supported. Is the Extent limit resource-constrained? If so, I think we may as well let the user do as wild reductions as their hardware allows until we have a good alternative path to take based on some heuristic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a specific scenario in mind where this would break ABI? Seems to me like this is all in headers and since the extent is a template parameter I am not sure if user-code incompatibility should be much of a concern.

No specific scenario in mind -- I'm just wary of committing a reducer implementation and somehow being stuck with it.

Is the Extent limit resource-constrained? If so, I think we may as well let the user do as wild reductions as their hardware allows until we have a good alternative path to take based on some heuristic.

The current implementation assumes that there are sufficient resources to give each work-item its own copy of the range being reduced (in private memory). So I think you're right that the limit will be different on different hardware.

I can foresee a few different implementations of the reducer class:

  1. The current implementation (work-item privatization)
  2. An implementation that uses atomics to a partially privatized copy (e.g. work-group privatization)
  3. An implementation that uses atomics directly to the reduction variable (no privatization)

I was thinking of doing something like creating an enum for different reduction implementations, and then basing the SFINAE on that. So the current implementation would only be available if reduction_algorithm == work_item_privatization or something like that, and every reduction we create today would have that trait set. If we change the reduction() function later to create reductions with different algorithms for different extents, any reducer already compiled to use work_item_privatization would still work.

Do you think something like that would make sense?

Copy link
Contributor

@steffenlarsen steffenlarsen Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me. Quite frankly I expected we would eventually reach a point where we would need to expose different reduction strategies to the user and what you suggest is pretty much how I expected we would do it.

How exactly we would want to approach that though is a good question. It sounds like this choice would be user-facing and as such we would probably need an extension. That said, it seems like it would be a great excuse to bring compile-time properties to reduction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've restructured things to avoid code duplication in 111ea2f. While I was doing it I've added a few new template arguments that we can use for specialization.

I'm using an algorithm class to store all the implementation-specific stuff because it gives us a bit more flexibility than an enum, but I think once we've got a few more of these we should definitely think about exposing a few compile-time properties. We've already had a request for a few different properties (see #1621). A good test of this new class structure would be whether we can move the existing SYCL_REDUCTION_DETERMINISTIC into something like a deterministic_reduction_algorithm class and just plug it in.

I'm hoping that the combination of Dims, Extent and Algorithm will be enough to do the filtering for the parameter pack. 🤞

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. How should this be extended to the reduction form accepting a parameter pack?
    The current parameter pack implementation performs each step of a reduction algorithm for all reductions (i.e. initializes all temporary storage, computes all partial sums, updates all global storage) which prevents looping over the elements. I can't decide whether it's better to try and squeeze the new reducer functionality into the existing framework (i.e. by using more memory than required) or if we need to revisit the design of the parameter pack solution knowing that SYCL 2020 has added different reducer types that may require different algorithms. (I'm leaning towards the second option).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to have it as part of a separate PR, but I think in the end it would definitely make sense to have the variadic reductions do with span as well. If you think retrofitting the current implementation to the extent-based iteration strategy is more hassle than it is worth, I don't think there's a reason not to consider a refactor of the implementation. I suspect @v-klochkov has a better mental model of the magical variadic version, so maybe he has some ideas of feasible ways to make it work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, great! I do think it would be a lot of work to retrofit things -- the resulting code wouldn't be very efficient for spans, and I suspect we'd just be putting off refactoring work that we'll need to do later.

If there was some way to filter out different kinds of reductions from the pack, I think we could implement something quite clean... Scalar reductions could use the existing code, the current span reductions could be forwarded somewhere else, and if we introduce a new kind of reduction/algorithm later we could handle them with a similar filter. I have no idea how to go about implementing such a thing, though -- @v-klochkov, does that sound feasible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the question is to add sycl::span to existing variadic-pack implementation, then it should be doable.
If we also want to use different algorithm per each of reduction in the pack depending on the type and binary-operation, then it is more complex work. I thought it would be variadic pack filters, similar to those (or may be exactly those):

/// For each index 'I' from the given indices pack 'Is' this function returns
/// an index sequence consisting of only those 'I's for which the 'FunctorT'
/// applied to 'T[I]' returns true.
template <typename... T, typename FunctorT, size_t... Is>
constexpr auto filterSequence(FunctorT F, std::index_sequence<Is...> Indices) {
  return filterSequenceHelper<T...>(F, Indices);
}

That work also requires some caution about the device code-quality and checking how those tricks with sequences are translated to the final code. On the C++ higher level I hoped FE can do/eliminate most of work/code because all values would be compile time known, and generate something simple at the end, but that is not for sure...

/// Constructs a reduction object using the reduction variable referenced by
/// the given sycl::span \p Span, reduction operation \p Combiner, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: It is not problem of your patch, I see this comment is copy-pasted. I am just adding this comment to here for future reference/reminder.

The reference "\p Combiner" in the comment is invalid because the arguments list does not have it, it has only the type 'BinaryOperation'

/// optional reduction properties.
/// The reduction algorithm may be less efficient for this variant as the
/// reduction identity is not known statically and it is not provided by user.
Pennycook marked this conversation as resolved.
Show resolved Hide resolved
template <typename T, size_t Extent, typename BinaryOperation>
std::enable_if_t<Extent != dynamic_extent &&
has_known_identity<BinaryOperation, T>::value,
ext::oneapi::detail::reduction_impl<span<T, Extent>,
BinaryOperation, 1, true>>
reduction(span<T, Extent> Span, BinaryOperation,
const property_list &PropList = {}) {
bool InitializeToIdentity =
PropList.has_property<property::reduction::initialize_to_identity>();
return {Span, InitializeToIdentity};
}

/// Constructs a reduction object using the reduction variable referenced by
/// the given sycl::span \p Span, reduction operation \p Combiner, and
/// optional reduction properties.
/// The reduction algorithm may be less efficient for this variant as the
/// reduction identity is not known statically and it is not provided by user.
template <typename T, size_t Extent, typename BinaryOperation>
std::enable_if_t<Extent != dynamic_extent &&
!has_known_identity<BinaryOperation, T>::value,
ext::oneapi::detail::reduction_impl<span<T, Extent>,
BinaryOperation, 1, true>>
reduction(span<T, Extent> Span, BinaryOperation,
const property_list &PropList = {}) {
// TODO: implement reduction that works even when identity is not known.
(void)PropList;
throw runtime_error("Identity-less reductions with unknown identity are not "
"supported yet.",
PI_INVALID_VALUE);
}

/// Constructs a reduction object using the reduction variable referenced by
/// the given sycl::span \p Span, reduction identity value \p Identity,
/// reduction operation \p Combiner, and optional reduction properties.
template <typename T, size_t Extent, typename BinaryOperation>
std::enable_if_t<Extent != dynamic_extent,
ext::oneapi::detail::reduction_impl<span<T, Extent>,
BinaryOperation, 1, true>>
reduction(span<T, Extent> Span, const T &Identity, BinaryOperation Combiner,
const property_list &PropList = {}) {
bool InitializeToIdentity =
PropList.has_property<property::reduction::initialize_to_identity>();
return {Span, Identity, Combiner, InitializeToIdentity};
}
#endif

} // namespace sycl
} // __SYCL_INLINE_NAMESPACE(cl)
Loading