-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Improve range reduction performance on CPU #6164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Improve range reduction performance on CPU #6164
Conversation
The performance improvement is the result of two complementary changes: 1) Using an alternative heuristic to select work-group size on the CPU. Keeping work-groups small simplifies combination of partial results and reduces the number of temporary variables. 2) Adjusting the mapping of the range to an ND-range. Breaking the range into contiguous chunks that are assigned to each results in streaming patterns that are better-suited to prefetching hardware. Signed-off-by: John Pennycook <john.pennycook@intel.com>
Signed-off-by: John Pennycook <john.pennycook@intel.com>
Added TODO explaining why it might be necessary to restore it. Signed-off-by: John Pennycook <john.pennycook@intel.com>
sycl/include/CL/sycl/handler.hpp
Outdated
// an environment variable is provided to allow users to override this | ||
// behavior. | ||
if (detail::getDeviceFromHandler(*this).is_cpu()) { | ||
if (const char *MaxWGSizeEnv = getenv("SYCL_CPU_REDUCTION_MAX_WG_SIZE")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update a file documenting other env variables with this one's description?
Also, I'm not fluent with the codebase, but so far all the settings I've seen were done using static variables so that the value is read exactly one. How common is the approach taken here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, such env-vars are handled as here: https://github.com/intel/llvm/blob/sycl/sycl/source/detail/config.hpp#L186
Wouldn't it be useful to have the environment variable redefining that value for GPU or ACC as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, absolutely. I have changed the name of the environment variable slightly, added a GPU and ACC variant, defined them as SYCLConfig
, and documented them in sycl/doc/EnvironmentVariables.md.
I have also added some additional restrictions on the environment variable, such as it throwing an error if it is invalid and if it exceeds the maximum on the device that will be used instead. The checks have also been moved to reduGetMaxWGSize
, partly because configs needs to be in source and partly because it seems to be the right place to overwrite the maximum work groups side (also reducing unneeded work.)
XPTI makes assumptions about the number of arguments in reduction kernels. Since this patch adds the An alternative is to move the calculation of |
/verify with intel/llvm-test-suite#1040 |
…move overwrite to source Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Co-authored-by: Vyacheslav Klochkov <vyacheslav.n.klochkov@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
It seems that with the generalization of max work-group size causes issues with user-defined work-group sizes. I think maybe we have this backwards and instead of limiting the max work-group size it should set a "preferred" work-group size that will be picked. This solution would move the early exits to |
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
I have made changes to the approach so instead of dictating the maximum work-group size, the configs and the 16 wg size for CPU are now a "preferred" work-group size instead, meaning that when the user haven't defined a work-group (i.e. range reduction rather than nd_range) the preferred work-group size is used instead. As @v-klochkov & @aelovikov-intel - Let me know what you think. If we agree that this is the way to go, I will change the description of the PR. |
Your explanation comment sounds very reasonable, but I'm afraid I'm not familiar enough with the code to make any decisions yet. Letting @v-klochkov to answer... |
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Sorry for the delay. I have added a test to verify that the configuration correctly parses valid configuration values. |
/verify with intel/llvm-test-suite#1040 |
return Value.Accelerator; | ||
default: | ||
throw INVALID_CONFIG_EXCEPTION( | ||
BaseT, "Device types must be \"cpu\", \"gpu\", \"acc\", or \"*\"."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't handling *
here, are we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not, but the message is most relevant for users rather than internal. I.e. if a user gives a wrong device type we want to also include * in the message to make sure they know it's an option. If get uses an invalid value for this then this doesn't help the user much anyway.
I could try and move it out of this function, but it will need a bit of refactoring.
if (ext::oneapi::detail::reduCGFuncForRange<KernelName>( | ||
*this, KernelFunc, Range, MaxWGSize, NumConcurrentWorkGroups, | ||
*this, KernelFunc, Range, PrefWGSize, NumConcurrentWorkGroups, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final decision is based on both this new env config and NumConcurrentWorkGroups
above, as we'll do
bool reduCGFuncForRange(handler &CGH, KernelType KernelFunc,
const range<Dims> &Range, size_t MaxWGSize,
uint32_t NumConcurrentWorkGroups, Reduction &Redu) {
size_t NWorkItems = Range.size();
size_t WGSize = std::min(NWorkItems, MaxWGSize);
size_t NWorkGroups = NWorkItems / WGSize;
if (NWorkItems % WGSize)
NWorkGroups++;
size_t MaxNWorkGroups = NumConcurrentWorkGroups;
NWorkGroups = std::min(NWorkGroups, MaxNWorkGroups);
Does it match your intent? Or maybe we should uplift that piece of code from reduction.hpp
to here and make the env. config have ultimate control?
Up to you and not a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that is in line with the expected behavior. The configuration is only intended as a preference, so if the implementation knows better it can make a more educated decision, like here where it will go for a smaller work-group size if there are fewer items than the setting.
// from overwriting the program-wide configurations. | ||
TEST(ConfigTests, CheckPreferredWGSizeConfigProcessing) { | ||
SetAndCheck("cpu:32", 32, 0, 0); | ||
SetAndCheck("gpu:32", 0, 32, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can't set different values to different GPUs. I think this is fine for this PR, but maybe we'd need to change it later.
Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could extend it in the future indeed, but I would also like to see it as a property in the future instead.
Co-authored-by: aelovikov-intel <andrei.elovikov@intel.com>
/verify with intel/llvm-test-suite#1040 |
intel/llvm#6164 adds an additional implicit argument to reduction kernels, so the XPTI test making assumptions about the number of arguments in a reduction kernel must have the assumptions updated.
* ext::oneapi::reduction removed in #6634 * sycl::item in kernel supported since #7478 * sycl::range + many reductions implemented in #7456 * CPU reduction performance implemented in #6164 * span support implemented in #6019 There might be other things that have been implemented already, but I cannot immediately identify them, if any.
intel#6164 adds an additional implicit argument to reduction kernels, so the XPTI test making assumptions about the number of arguments in a reduction kernel must have the assumptions updated.
…m-test-suite#1040) intel#6164 adds an additional implicit argument to reduction kernels, so the XPTI test making assumptions about the number of arguments in a reduction kernel must have the assumptions updated.
The performance improvement is the result of two complementary changes:
Using an alternative heuristic to select work-group size on the CPU.
Keeping work-groups small simplifies combination of partial results
and reduces the number of temporary variables.
Adjusting the mapping of the range to an ND-range.
Breaking the range into contiguous chunks that are assigned to each
results in streaming patterns that are better-suited to prefetching
hardware.
Signed-off-by: John Pennycook john.pennycook@intel.com