[SYCL] Improve range reduction performance on CPU #6164

Pennycook · 2022-05-17T20:16:54Z

The performance improvement is the result of two complementary changes:

Using an alternative heuristic to select work-group size on the CPU.
Keeping work-groups small simplifies combination of partial results
and reduces the number of temporary variables.
Adjusting the mapping of the range to an ND-range.
Breaking the range into contiguous chunks that are assigned to each
results in streaming patterns that are better-suited to prefetching
hardware.

Signed-off-by: John Pennycook john.pennycook@intel.com

The performance improvement is the result of two complementary changes: 1) Using an alternative heuristic to select work-group size on the CPU. Keeping work-groups small simplifies combination of partial results and reduces the number of temporary variables. 2) Adjusting the mapping of the range to an ND-range. Breaking the range into contiguous chunks that are assigned to each results in streaming patterns that are better-suited to prefetching hardware. Signed-off-by: John Pennycook <john.pennycook@intel.com>

sycl/include/sycl/ext/oneapi/reduction.hpp

Signed-off-by: John Pennycook <john.pennycook@intel.com>

sycl/include/sycl/ext/oneapi/reduction.hpp

Added TODO explaining why it might be necessary to restore it. Signed-off-by: John Pennycook <john.pennycook@intel.com>

aelovikov-intel · 2022-05-18T15:19:55Z

sycl/include/CL/sycl/handler.hpp

+    // an environment variable is provided to allow users to override this
+    // behavior.
+    if (detail::getDeviceFromHandler(*this).is_cpu()) {
+      if (const char *MaxWGSizeEnv = getenv("SYCL_CPU_REDUCTION_MAX_WG_SIZE")) {


Do we need to update a file documenting other env variables with this one's description?

Also, I'm not fluent with the codebase, but so far all the settings I've seen were done using static variables so that the value is read exactly one. How common is the approach taken here?

Usually, such env-vars are handled as here: https://github.com/intel/llvm/blob/sycl/sycl/source/detail/config.hpp#L186

Wouldn't it be useful to have the environment variable redefining that value for GPU or ACC as well?

Yeah, absolutely. I have changed the name of the environment variable slightly, added a GPU and ACC variant, defined them as SYCLConfig, and documented them in sycl/doc/EnvironmentVariables.md.

I have also added some additional restrictions on the environment variable, such as it throwing an error if it is invalid and if it exceeds the maximum on the device that will be used instead. The checks have also been moved to reduGetMaxWGSize, partly because configs needs to be in source and partly because it seems to be the right place to overwrite the maximum work groups side (also reducing unneeded work.)

sycl/include/sycl/ext/oneapi/reduction.hpp

steffenlarsen · 2022-05-26T12:45:14Z

XPTI makes assumptions about the number of arguments in reduction kernels. Since this patch adds the PerGroup as an implicit argument these assumptions fail. intel/llvm-test-suite#1040 updates these values.

An alternative is to move the calculation of PerGroup into the kernel, which means less arguments to the kernel but repeated work for each work-item. @v-klochkov - Thoughts?

steffenlarsen · 2022-05-26T12:45:23Z

/verify with intel/llvm-test-suite#1040

…move overwrite to source Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>

sycl/source/detail/config.def

sycl/source/detail/config.hpp

sycl/source/detail/reduction.cpp

sycl/doc/EnvironmentVariables.md

Co-authored-by: Vyacheslav Klochkov <vyacheslav.n.klochkov@intel.com>

Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>

steffenlarsen · 2022-05-27T08:58:22Z

It seems that with the generalization of max work-group size causes issues with user-defined work-group sizes. I think maybe we have this backwards and instead of limiting the max work-group size it should set a "preferred" work-group size that will be picked. This solution would move the early exits to reduComputeWGSize rather than reduGetMaxWGSize.

Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>

steffenlarsen · 2022-05-27T10:49:33Z

It seems that with the generalization of max work-group size causes issues with user-defined work-group sizes. I think maybe we have this backwards and instead of limiting the max work-group size it should set a "preferred" work-group size that will be picked. This solution would move the early exits to reduComputeWGSize rather than reduGetMaxWGSize.

I have made changes to the approach so instead of dictating the maximum work-group size, the configs and the 16 wg size for CPU are now a "preferred" work-group size instead, meaning that when the user haven't defined a work-group (i.e. range reduction rather than nd_range) the preferred work-group size is used instead.

As reduComputeWGSize would be too late to do this, the implementation of this is in a new function (reduGetPreferredWGSize) which returns reduGetMaxWGSize if no other configuration is selected, avoiding the overhead if it is not needed.

@v-klochkov & @aelovikov-intel - Let me know what you think. If we agree that this is the way to go, I will change the description of the PR.

aelovikov-intel · 2022-05-31T16:02:49Z

@v-klochkov & @aelovikov-intel - Let me know what you think.

Your explanation comment sounds very reasonable, but I'm afraid I'm not familiar enough with the code to make any decisions yet. Letting @v-klochkov to answer...

…rformance

Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>

steffenlarsen · 2022-08-12T10:52:54Z

Please add a test for the new env variable specifying the preferred group size.

Sorry for the delay. I have added a test to verify that the configuration correctly parses valid configuration values.

steffenlarsen · 2022-08-12T12:54:53Z

/verify with intel/llvm-test-suite#1040

aelovikov-intel · 2022-08-12T18:24:41Z

sycl/source/detail/config.hpp

+      return Value.Accelerator;
+    default:
+      throw INVALID_CONFIG_EXCEPTION(
+          BaseT, "Device types must be \"cpu\", \"gpu\", \"acc\", or \"*\".");


We aren't handling * here, are we?

We are not, but the message is most relevant for users rather than internal. I.e. if a user gives a wrong device type we want to also include * in the message to make sure they know it's an option. If get uses an invalid value for this then this doesn't help the user much anyway.

I could try and move it out of this function, but it will need a bit of refactoring.

aelovikov-intel · 2022-08-12T18:37:14Z

sycl/include/sycl/handler.hpp

    if (ext::oneapi::detail::reduCGFuncForRange<KernelName>(
-            *this, KernelFunc, Range, MaxWGSize, NumConcurrentWorkGroups,
+            *this, KernelFunc, Range, PrefWGSize, NumConcurrentWorkGroups,


Final decision is based on both this new env config and NumConcurrentWorkGroups above, as we'll do

bool reduCGFuncForRange(handler &CGH, KernelType KernelFunc, const range<Dims> &Range, size_t MaxWGSize, uint32_t NumConcurrentWorkGroups, Reduction &Redu) { size_t NWorkItems = Range.size(); size_t WGSize = std::min(NWorkItems, MaxWGSize); size_t NWorkGroups = NWorkItems / WGSize; if (NWorkItems % WGSize) NWorkGroups++; size_t MaxNWorkGroups = NumConcurrentWorkGroups; NWorkGroups = std::min(NWorkGroups, MaxNWorkGroups);

Does it match your intent? Or maybe we should uplift that piece of code from reduction.hpp to here and make the env. config have ultimate control?

Up to you and not a blocker.

I believe that is in line with the expected behavior. The configuration is only intended as a preference, so if the implementation knows better it can make a more educated decision, like here where it will go for a smaller work-group size if there are fewer items than the setting.

aelovikov-intel · 2022-08-12T18:40:52Z

sycl/unittests/config/PreferredWGSizeConfigTests.cpp

+// from overwriting the program-wide configurations.
+TEST(ConfigTests, CheckPreferredWGSizeConfigProcessing) {
+  SetAndCheck("cpu:32", 32, 0, 0);
+  SetAndCheck("gpu:32", 0, 32, 0);


So we can't set different values to different GPUs. I think this is fine for this PR, but maybe we'd need to change it later.

Any thoughts?

We could extend it in the future indeed, but I would also like to see it as a property in the future instead.

sycl/source/detail/config.hpp

Co-authored-by: aelovikov-intel <andrei.elovikov@intel.com>

steffenlarsen · 2022-08-15T15:16:56Z

/verify with intel/llvm-test-suite#1040

intel/llvm#6164 adds an additional implicit argument to reduction kernels, so the XPTI test making assumptions about the number of arguments in a reduction kernel must have the assumptions updated.

* ext::oneapi::reduction removed in #6634 * sycl::item in kernel supported since #7478 * sycl::range + many reductions implemented in #7456 * CPU reduction performance implemented in #6164 * span support implemented in #6019 There might be other things that have been implemented already, but I cannot immediately identify them, if any.

intel#6164 adds an additional implicit argument to reduction kernels, so the XPTI test making assumptions about the number of arguments in a reduction kernel must have the assumptions updated.

…m-test-suite#1040) intel#6164 adds an additional implicit argument to reduction kernels, so the XPTI test making assumptions about the number of arguments in a reduction kernel must have the assumptions updated.

Pennycook added the performance Performance related issues label May 17, 2022

Pennycook requested a review from a team as a code owner May 17, 2022 20:16

Pennycook assigned steffenlarsen May 17, 2022

Pennycook requested a review from aelovikov-intel May 17, 2022 20:16

pvchupin requested review from steffenlarsen and v-klochkov May 17, 2022 22:02

v-klochkov reviewed May 18, 2022

View reviewed changes

sycl/include/sycl/ext/oneapi/reduction.hpp Outdated Show resolved Hide resolved

[SYCL] Simplify group division code

b345236

Signed-off-by: John Pennycook <john.pennycook@intel.com>

v-klochkov reviewed May 18, 2022

View reviewed changes

sycl/include/sycl/ext/oneapi/reduction.hpp Outdated Show resolved Hide resolved

[SYCL] Remove unused Remainder parameter

782b29c

Added TODO explaining why it might be necessary to restore it. Signed-off-by: John Pennycook <john.pennycook@intel.com>

aelovikov-intel reviewed May 18, 2022

View reviewed changes

steffenlarsen mentioned this pull request May 26, 2022

[SYCL][XPTI] Increase number of expected args in reduction intel/llvm-test-suite#1040

Merged

Make config interface for new env var, add cpu and acc variants, and …

8d23d5c

…move overwrite to source Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>

Pennycook requested a review from a team as a code owner May 26, 2022 15:59