Skip to content

[SYCL] Improve range reduction performance on CPU #6164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 16, 2022

Conversation

Pennycook
Copy link
Contributor

The performance improvement is the result of two complementary changes:

  1. Using an alternative heuristic to select work-group size on the CPU.
    Keeping work-groups small simplifies combination of partial results
    and reduces the number of temporary variables.

  2. Adjusting the mapping of the range to an ND-range.
    Breaking the range into contiguous chunks that are assigned to each
    results in streaming patterns that are better-suited to prefetching
    hardware.

Signed-off-by: John Pennycook john.pennycook@intel.com

The performance improvement is the result of two complementary changes:

1) Using an alternative heuristic to select work-group size on the CPU.
   Keeping work-groups small simplifies combination of partial results
   and reduces the number of temporary variables.

2) Adjusting the mapping of the range to an ND-range.
   Breaking the range into contiguous chunks that are assigned to each
   results in streaming patterns that are better-suited to prefetching
   hardware.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook Pennycook added the performance Performance related issues label May 17, 2022
@Pennycook Pennycook requested a review from a team as a code owner May 17, 2022 20:16
@Pennycook Pennycook requested a review from aelovikov-intel May 17, 2022 20:16
Signed-off-by: John Pennycook <john.pennycook@intel.com>
Added TODO explaining why it might be necessary to restore it.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
// an environment variable is provided to allow users to override this
// behavior.
if (detail::getDeviceFromHandler(*this).is_cpu()) {
if (const char *MaxWGSizeEnv = getenv("SYCL_CPU_REDUCTION_MAX_WG_SIZE")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update a file documenting other env variables with this one's description?

Also, I'm not fluent with the codebase, but so far all the settings I've seen were done using static variables so that the value is read exactly one. How common is the approach taken here?

Copy link
Contributor

@v-klochkov v-klochkov May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, such env-vars are handled as here: https://github.com/intel/llvm/blob/sycl/sycl/source/detail/config.hpp#L186

Wouldn't it be useful to have the environment variable redefining that value for GPU or ACC as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, absolutely. I have changed the name of the environment variable slightly, added a GPU and ACC variant, defined them as SYCLConfig, and documented them in sycl/doc/EnvironmentVariables.md.

I have also added some additional restrictions on the environment variable, such as it throwing an error if it is invalid and if it exceeds the maximum on the device that will be used instead. The checks have also been moved to reduGetMaxWGSize, partly because configs needs to be in source and partly because it seems to be the right place to overwrite the maximum work groups side (also reducing unneeded work.)

@steffenlarsen
Copy link
Contributor

XPTI makes assumptions about the number of arguments in reduction kernels. Since this patch adds the PerGroup as an implicit argument these assumptions fail. intel/llvm-test-suite#1040 updates these values.

An alternative is to move the calculation of PerGroup into the kernel, which means less arguments to the kernel but repeated work for each work-item. @v-klochkov - Thoughts?

@steffenlarsen
Copy link
Contributor

/verify with intel/llvm-test-suite#1040

…move overwrite to source

Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
@Pennycook Pennycook requested a review from a team as a code owner May 26, 2022 15:59
steffenlarsen and others added 3 commits May 27, 2022 09:05
Co-authored-by: Vyacheslav Klochkov <vyacheslav.n.klochkov@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
@steffenlarsen
Copy link
Contributor

It seems that with the generalization of max work-group size causes issues with user-defined work-group sizes. I think maybe we have this backwards and instead of limiting the max work-group size it should set a "preferred" work-group size that will be picked. This solution would move the early exits to reduComputeWGSize rather than reduGetMaxWGSize.

Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
@steffenlarsen
Copy link
Contributor

It seems that with the generalization of max work-group size causes issues with user-defined work-group sizes. I think maybe we have this backwards and instead of limiting the max work-group size it should set a "preferred" work-group size that will be picked. This solution would move the early exits to reduComputeWGSize rather than reduGetMaxWGSize.

I have made changes to the approach so instead of dictating the maximum work-group size, the configs and the 16 wg size for CPU are now a "preferred" work-group size instead, meaning that when the user haven't defined a work-group (i.e. range reduction rather than nd_range) the preferred work-group size is used instead.

As reduComputeWGSize would be too late to do this, the implementation of this is in a new function (reduGetPreferredWGSize) which returns reduGetMaxWGSize if no other configuration is selected, avoiding the overhead if it is not needed.

@v-klochkov & @aelovikov-intel - Let me know what you think. If we agree that this is the way to go, I will change the description of the PR.

@aelovikov-intel
Copy link
Contributor

@v-klochkov & @aelovikov-intel - Let me know what you think.

Your explanation comment sounds very reasonable, but I'm afraid I'm not familiar enough with the code to make any decisions yet. Letting @v-klochkov to answer...

v-klochkov
v-klochkov previously approved these changes May 31, 2022
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
@steffenlarsen
Copy link
Contributor

Please add a test for the new env variable specifying the preferred group size.

Sorry for the delay. I have added a test to verify that the configuration correctly parses valid configuration values.

@steffenlarsen steffenlarsen temporarily deployed to aws August 12, 2022 11:12 Inactive
@steffenlarsen steffenlarsen temporarily deployed to aws August 12, 2022 11:30 Inactive
@steffenlarsen
Copy link
Contributor

/verify with intel/llvm-test-suite#1040

return Value.Accelerator;
default:
throw INVALID_CONFIG_EXCEPTION(
BaseT, "Device types must be \"cpu\", \"gpu\", \"acc\", or \"*\".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't handling * here, are we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not, but the message is most relevant for users rather than internal. I.e. if a user gives a wrong device type we want to also include * in the message to make sure they know it's an option. If get uses an invalid value for this then this doesn't help the user much anyway.

I could try and move it out of this function, but it will need a bit of refactoring.

if (ext::oneapi::detail::reduCGFuncForRange<KernelName>(
*this, KernelFunc, Range, MaxWGSize, NumConcurrentWorkGroups,
*this, KernelFunc, Range, PrefWGSize, NumConcurrentWorkGroups,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final decision is based on both this new env config and NumConcurrentWorkGroups above, as we'll do

bool reduCGFuncForRange(handler &CGH, KernelType KernelFunc,
                        const range<Dims> &Range, size_t MaxWGSize,
                        uint32_t NumConcurrentWorkGroups, Reduction &Redu) {
  size_t NWorkItems = Range.size();
  size_t WGSize = std::min(NWorkItems, MaxWGSize);
  size_t NWorkGroups = NWorkItems / WGSize;
  if (NWorkItems % WGSize)
    NWorkGroups++;
  size_t MaxNWorkGroups = NumConcurrentWorkGroups;
  NWorkGroups = std::min(NWorkGroups, MaxNWorkGroups);

Does it match your intent? Or maybe we should uplift that piece of code from reduction.hpp to here and make the env. config have ultimate control?

Up to you and not a blocker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is in line with the expected behavior. The configuration is only intended as a preference, so if the implementation knows better it can make a more educated decision, like here where it will go for a smaller work-group size if there are fewer items than the setting.

// from overwriting the program-wide configurations.
TEST(ConfigTests, CheckPreferredWGSizeConfigProcessing) {
SetAndCheck("cpu:32", 32, 0, 0);
SetAndCheck("gpu:32", 0, 32, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can't set different values to different GPUs. I think this is fine for this PR, but maybe we'd need to change it later.

Any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could extend it in the future indeed, but I would also like to see it as a property in the future instead.

Co-authored-by: aelovikov-intel <andrei.elovikov@intel.com>
@steffenlarsen steffenlarsen temporarily deployed to aws August 12, 2022 19:01 Inactive
@steffenlarsen steffenlarsen temporarily deployed to aws August 15, 2022 14:55 Inactive
@steffenlarsen steffenlarsen temporarily deployed to aws August 15, 2022 15:12 Inactive
@steffenlarsen
Copy link
Contributor

/verify with intel/llvm-test-suite#1040

@steffenlarsen steffenlarsen temporarily deployed to aws August 16, 2022 11:01 Inactive
@steffenlarsen steffenlarsen temporarily deployed to aws August 16, 2022 11:03 Inactive
@steffenlarsen steffenlarsen temporarily deployed to aws August 16, 2022 11:40 Inactive
@steffenlarsen steffenlarsen temporarily deployed to aws August 16, 2022 11:57 Inactive
@steffenlarsen steffenlarsen requested a review from pvchupin August 16, 2022 12:14
@pvchupin pvchupin merged commit 3323da6 into intel:sycl Aug 16, 2022
pvchupin pushed a commit to intel/llvm-test-suite that referenced this pull request Aug 16, 2022
intel/llvm#6164 adds an additional implicit argument to reduction kernels, so the XPTI test making assumptions about the number of arguments in a reduction kernel must have the assumptions updated.
bader pushed a commit that referenced this pull request Dec 13, 2022
* ext::oneapi::reduction removed in
#6634
* sycl::item in kernel supported since
#7478
* sycl::range + many reductions implemented in
#7456
* CPU reduction performance implemented in
#6164
* span support implemented in #6019

There might be other things that have been implemented already, but I
cannot immediately identify them, if any.
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Feb 23, 2023
intel#6164 adds an additional implicit argument to reduction kernels, so the XPTI test making assumptions about the number of arguments in a reduction kernel must have the assumptions updated.
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
…m-test-suite#1040)

intel#6164 adds an additional implicit argument to reduction kernels, so the XPTI test making assumptions about the number of arguments in a reduction kernel must have the assumptions updated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants