Skip to content

[SYCL] Improve range reduction performance on CPU #6164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions sycl/doc/EnvironmentVariables.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ compiler and runtime.
| `SYCL_RT_WARNING_LEVEL` | Positive integer | The higher warning level is used the more warnings and performance hints the runtime library may print. Default value is '0', which means no warning/hint messages from the runtime library are allowed. The value '1' enables performance warnings from device runtime/codegen. The values greater than 1 are reserved for future use. |
| `SYCL_USM_HOSTPTR_IMPORT` | Integer | Enable by specifying non-zero value. Buffers created with a host pointer will result in host data promotion to USM, improving data transfer performance. To use this feature, also set SYCL_HOST_UNIFIED_MEMORY=1. |
| `SYCL_EAGER_INIT` | Integer | Enable by specifying non-zero value. Tells the SYCL runtime to do as much as possible initialization at objects construction as opposed to doing lazy initialization on the fly. This may mean doing some redundant work at warmup but ensures fastest possible execution on the following hot and reportable paths. It also instructs PI plugins to do the same. Default is "0". |
| `SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE` | See [below](#sycl_reduction_preferred_workgroup_size) | Controls the preferred work-group size of reductions. |

`(*) Note: Any means this environment variable is effective when set to any non-null value.`

Expand Down Expand Up @@ -60,6 +61,32 @@ Assuming a filter has all three elements of the triple, it selects only those de

Note that all device selectors will throw an exception if the filtered list of devices does not include a device that satisfies the selector. For instance, `SYCL_DEVICE_FILTER=cpu,level_zero` will cause `host_selector()` to throw an exception. `SYCL_DEVICE_FILTER` also limits loading only specified plugins into the SYCL RT. In particular, `SYCL_DEVICE_FILTER=level_zero` will cause the `cpu_selector` to throw an exception since SYCL RT will only load the `level_zero` backend which does not support any CPU devices at this time. When multiple devices satisfy the filter (e..g, `SYCL_DEVICE_FILTER=gpu`), only one of them will be selected.

## `SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE`

This environment variable controls the preferred work-group size for reductions on specified device types. Setting this will affect all reductions without an explicitly specified work-group size on devices of types in the value of the environment variable.

The value of this environment variable is a comma separated list of one or more configurations, where each configuration is a pair of the form "`device_type`:`size`" (without the quotes). Possible values of `device_type` are:
- `cpu`
- `gpu`
- `acc`
- `*`

`size` is a positive integer larger than 0.

For a configuration `device_type`:`size` the `device_type` element specifies the type of device the configuration applies to, that is `cpu` is for CPU devices, `gpu` is for GPU devices, and `acc` is for accelerator devices. If `device_type` is `*` the configuration applies to all applicable device types. `size` denotes the preferred work-group size to be used for devices of types specified by `device_type`.

If `info::device::max_work_group_size` on a device on which a reduction is being enqueued is less than the value specified by a configuration in this environment variable, the value of `info::device::max_work_group_size` on that device is used instead.

A `sycl::exception` with `sycl::errc::invalid` is thrown during submission of a reduction kernel in the following cases:
- If the specified device type in any configuration is not one of the valid values.
- If the specified preferred work-group size in any configuration is not a valid integer.
- If the specified preferred work-group size in any configuration is not an integer value larger than 0.
- If any configuration does not have the `:` delimiter.

If this environment variable is not set, the preferred work-group size for reductions is implementation defined.

Note that conflicting configuration tuples in the same list will favor the last entry. For example, a list `cpu:32,gpu:32,cpu:16` will set the preferred work-group size of reductions to 32 for GPUs and 16 for CPUs. This also applies to `*`, for example `cpu:32,*:16` sets the preferred work-group size of reductions on all devices to 16, while `*:16,cpu:32` sets the preferred work-group size of reductions to 32 on CPUs and to 16 on all other devices.

## Controlling DPC++ Level Zero Plugin

| Environment variable | Values | Description |
Expand Down
36 changes: 27 additions & 9 deletions sycl/include/sycl/ext/oneapi/reduction.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,8 @@ __SYCL_EXPORT size_t reduGetMaxWGSize(std::shared_ptr<queue_impl> Queue,
size_t LocalMemBytesPerWorkItem);
__SYCL_EXPORT size_t reduComputeWGSize(size_t NWorkItems, size_t MaxWGSize,
size_t &NWorkGroups);
__SYCL_EXPORT size_t reduGetPreferredWGSize(std::shared_ptr<queue_impl> &Queue,
size_t LocalMemBytesPerWorkItem);

/// Class that is used to represent objects that are passed to user's lambda
/// functions and representing users' reduction variable.
Expand Down Expand Up @@ -890,16 +892,28 @@ using __sycl_reduction_kernel =
sycl::detail::auto_name, Namer<KernelName, Ts...>>;

/// Called in device code. This function iterates through the index space
/// \p Range using stride equal to the global range specified in \p NdId,
/// by assigning contiguous chunks to each work-group, then iterating
/// through each chunk using a stride equal to the work-group's local range,
/// which gives much better performance than using stride equal to 1.
/// For each of the index the given \p F function/functor is called and
/// the reduction value hold in \p Reducer is accumulated in those calls.
template <typename KernelFunc, int Dims, typename ReducerT>
void reductionLoop(const range<Dims> &Range, ReducerT &Reducer,
const nd_item<1> &NdId, KernelFunc &F) {
size_t Start = NdId.get_global_id(0);
size_t End = Range.size();
size_t Stride = NdId.get_global_range(0);
void reductionLoop(const range<Dims> &Range, const size_t PerGroup,
ReducerT &Reducer, const nd_item<1> &NdId, KernelFunc &F) {
// Divide into contiguous chunks and assign each chunk to a Group
// Rely on precomputed division to avoid repeating expensive operations
// TODO: Some devices may prefer alternative remainder handling
auto Group = NdId.get_group();
size_t GroupId = Group.get_group_linear_id();
size_t NumGroups = Group.get_group_linear_range();
bool LastGroup = (GroupId == NumGroups - 1);
size_t GroupStart = GroupId * PerGroup;
size_t GroupEnd = LastGroup ? Range.size() : (GroupStart + PerGroup);

// Loop over the contiguous chunk
size_t Start = GroupStart + NdId.get_local_id(0);
size_t End = GroupEnd;
size_t Stride = NdId.get_local_range(0);
for (size_t I = Start; I < End; I += Stride)
F(sycl::detail::getDelinearizedId(Range, I), Reducer);
}
Expand All @@ -919,10 +933,12 @@ bool reduCGFuncForRangeFastAtomics(handler &CGH, KernelType KernelFunc,
auto GroupSum = Reduction::getReadWriteLocalAcc(NElements, CGH);
using Name = __sycl_reduction_kernel<reduction::main_krn::RangeFastAtomics,
KernelName>;
size_t NWorkGroups = NDRange.get_group_range().size();
size_t PerGroup = Range.size() / NWorkGroups;
CGH.parallel_for<Name>(NDRange, [=](nd_item<1> NDId) {
// Call user's functions. Reducer.MValue gets initialized there.
typename Reduction::reducer_type Reducer;
reductionLoop(Range, Reducer, NDId, KernelFunc);
reductionLoop(Range, PerGroup, Reducer, NDId, KernelFunc);

// Work-group cooperates to initialize multiple reduction variables
auto LID = NDId.get_local_id(0);
Expand Down Expand Up @@ -987,10 +1003,11 @@ bool reduCGFuncForRangeFastReduce(handler &CGH, KernelType KernelFunc,

using Name =
__sycl_reduction_kernel<reduction::main_krn::RangeFastReduce, KernelName>;
size_t PerGroup = Range.size() / NWorkGroups;
CGH.parallel_for<Name>(NDRange, [=](nd_item<1> NDId) {
// Call user's functions. Reducer.MValue gets initialized there.
typename Reduction::reducer_type Reducer;
reductionLoop(Range, Reducer, NDId, KernelFunc);
reductionLoop(Range, PerGroup, Reducer, NDId, KernelFunc);

typename Reduction::binary_operation BOp;
auto Group = NDId.get_group();
Expand Down Expand Up @@ -1081,10 +1098,11 @@ bool reduCGFuncForRangeBasic(handler &CGH, KernelType KernelFunc,
auto BOp = Redu.getBinaryOperation();
using Name =
__sycl_reduction_kernel<reduction::main_krn::RangeBasic, KernelName>;
size_t PerGroup = Range.size() / NWorkGroups;
CGH.parallel_for<Name>(NDRange, [=](nd_item<1> NDId) {
// Call user's functions. Reducer.MValue gets initialized there.
typename Reduction::reducer_type Reducer(Identity, BOp);
reductionLoop(Range, Reducer, NDId, KernelFunc);
reductionLoop(Range, PerGroup, Reducer, NDId, KernelFunc);

// If there are multiple values, reduce each separately
// This prevents local memory from scaling with elements
Expand Down
11 changes: 7 additions & 4 deletions sycl/include/sycl/handler.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,9 @@ reduGetMaxNumConcurrentWorkGroups(std::shared_ptr<queue_impl> Queue);
__SYCL_EXPORT size_t reduGetMaxWGSize(std::shared_ptr<queue_impl> Queue,
size_t LocalMemBytesPerWorkItem);

__SYCL_EXPORT size_t reduGetPreferredWGSize(std::shared_ptr<queue_impl> &Queue,
size_t LocalMemBytesPerWorkItem);

template <typename... ReductionT, size_t... Is>
size_t reduGetMemPerWorkItem(std::tuple<ReductionT...> &ReduTuple,
std::index_sequence<Is...>);
Expand Down Expand Up @@ -1618,13 +1621,13 @@ class __SYCL_EXPORT handler {
#else
ext::oneapi::detail::reduGetMaxNumConcurrentWorkGroups(MQueue);
#endif
// TODO: currently the maximal work group size is determined for the given
// TODO: currently the preferred work group size is determined for the given
// queue/device, while it is safer to use queries to the kernel pre-compiled
// for the device.
size_t MaxWGSize =
ext::oneapi::detail::reduGetMaxWGSize(MQueue, OneElemSize);
size_t PrefWGSize =
ext::oneapi::detail::reduGetPreferredWGSize(MQueue, OneElemSize);
if (ext::oneapi::detail::reduCGFuncForRange<KernelName>(
*this, KernelFunc, Range, MaxWGSize, NumConcurrentWorkGroups,
*this, KernelFunc, Range, PrefWGSize, NumConcurrentWorkGroups,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final decision is based on both this new env config and NumConcurrentWorkGroups above, as we'll do

bool reduCGFuncForRange(handler &CGH, KernelType KernelFunc,
                        const range<Dims> &Range, size_t MaxWGSize,
                        uint32_t NumConcurrentWorkGroups, Reduction &Redu) {
  size_t NWorkItems = Range.size();
  size_t WGSize = std::min(NWorkItems, MaxWGSize);
  size_t NWorkGroups = NWorkItems / WGSize;
  if (NWorkItems % WGSize)
    NWorkGroups++;
  size_t MaxNWorkGroups = NumConcurrentWorkGroups;
  NWorkGroups = std::min(NWorkGroups, MaxNWorkGroups);

Does it match your intent? Or maybe we should uplift that piece of code from reduction.hpp to here and make the env. config have ultimate control?

Up to you and not a blocker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is in line with the expected behavior. The configuration is only intended as a preference, so if the implementation knows better it can make a more educated decision, like here where it will go for a smaller work-group size if there are fewer items than the setting.

Redu)) {
this->finalize();
MLastEvent = withAuxHandler(QueueCopy, [&](handler &CopyHandler) {
Expand Down
1 change: 1 addition & 0 deletions sycl/source/detail/config.def
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,4 @@ CONFIG(INTEL_ENABLE_OFFLOAD_ANNOTATIONS, 1, __SYCL_INTEL_ENABLE_OFFLOAD_ANNOTATI
CONFIG(SYCL_ENABLE_DEFAULT_CONTEXTS, 1, __SYCL_ENABLE_DEFAULT_CONTEXTS)
CONFIG(SYCL_QUEUE_THREAD_POOL_SIZE, 4, __SYCL_QUEUE_THREAD_POOL_SIZE)
CONFIG(SYCL_RT_WARNING_LEVEL, 4, __SYCL_RT_WARNING_LEVEL)
CONFIG(SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE, 16, __SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE)
126 changes: 126 additions & 0 deletions sycl/source/detail/config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,11 @@ template <ConfigID Config> class SYCLConfigBase;
#include "config.def"
#undef CONFIG

#define INVALID_CONFIG_EXCEPTION(BASE, MSG) \
sycl::exception(sycl::make_error_code(sycl::errc::invalid), \
"Invalid value for " + std::string{BASE::MConfigName} + \
" environment variable: " + MSG)

template <ConfigID Config> class SYCLConfig {
using BaseT = SYCLConfigBase<Config>;

Expand Down Expand Up @@ -467,6 +472,127 @@ template <> class SYCLConfig<SYCL_CACHE_DIR> {
}
};

template <> class SYCLConfig<SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE> {
using BaseT = SYCLConfigBase<SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE>;

struct ParsedValue {
size_t CPU = 0;
size_t GPU = 0;
size_t Accelerator = 0;
};

public:
static size_t get(info::device_type DeviceType) {
ParsedValue Value = getCachedValue();
return getRefByDeviceType(Value, DeviceType);
}

static void reset() { (void)getCachedValue(/*ResetCache=*/true); }

static const char *getName() { return BaseT::MConfigName; }

private:
static size_t &getRefByDeviceType(ParsedValue &Value,
info::device_type DeviceType) {
switch (DeviceType) {
case info::device_type::cpu:
return Value.CPU;
case info::device_type::gpu:
return Value.GPU;
case info::device_type::accelerator:
return Value.Accelerator;
default:
// Expect to get here if user used wrong device type. Include wildcard
// in the message even though it's handled in the caller.
throw INVALID_CONFIG_EXCEPTION(
BaseT, "Device types must be \"cpu\", \"gpu\", \"acc\", or \"*\".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't handling * here, are we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not, but the message is most relevant for users rather than internal. I.e. if a user gives a wrong device type we want to also include * in the message to make sure they know it's an option. If get uses an invalid value for this then this doesn't help the user much anyway.

I could try and move it out of this function, but it will need a bit of refactoring.

}
}

static ParsedValue parseValue() {
const char *ValueRaw = BaseT::getRawValue();
ParsedValue Result{};

// Default to 0 to signify an unset value.
if (!ValueRaw)
return Result;

std::string ValueStr{ValueRaw};
auto DeviceTypeMap = getSyclDeviceTypeMap();

// Iterate over all configurations.
size_t Start = 0, End = 0;
do {
End = ValueStr.find(',', Start);
if (End == std::string::npos)
End = ValueStr.size();

// Get a substring of the current configuration pair.
std::string DeviceConfigStr = ValueStr.substr(Start, End - Start);

// Find the delimiter in the configuration pair.
size_t ConfigDelimLoc = DeviceConfigStr.find(':');
if (ConfigDelimLoc == std::string::npos)
throw INVALID_CONFIG_EXCEPTION(
BaseT, "Device-value pair \"" + DeviceConfigStr +
"\" does not contain the ':' delimiter.");

// Split configuration pair into its constituents.
std::string DeviceConfigTypeStr =
DeviceConfigStr.substr(0, ConfigDelimLoc);
std::string DeviceConfigValueStr = DeviceConfigStr.substr(
ConfigDelimLoc + 1, DeviceConfigStr.size() - ConfigDelimLoc - 1);

// Find the device type in the "device type map".
auto DeviceTypeIter = std::find_if(
std::begin(DeviceTypeMap), std::end(DeviceTypeMap),
[&](auto Element) { return DeviceConfigTypeStr == Element.first; });
if (DeviceTypeIter == DeviceTypeMap.end())
throw INVALID_CONFIG_EXCEPTION(
BaseT,
"\"" + DeviceConfigTypeStr + "\" is not a recognized device type.");

// Parse the configuration value.
int DeviceConfigValue = 1;
try {
DeviceConfigValue = std::stoi(DeviceConfigValueStr);
} catch (...) {
throw INVALID_CONFIG_EXCEPTION(
BaseT, "Value \"" + DeviceConfigValueStr + "\" must be a number");
}

if (DeviceConfigValue < 1)
throw INVALID_CONFIG_EXCEPTION(BaseT,
"Value \"" + DeviceConfigValueStr +
"\" must be larger than zero");

if (DeviceTypeIter->second == info::device_type::all) {
// Set all configuration values if we got the device-type wildcard.
Result.GPU = DeviceConfigValue;
Result.CPU = DeviceConfigValue;
Result.Accelerator = DeviceConfigValue;
} else {
// Try setting the corresponding configuration.
getRefByDeviceType(Result, DeviceTypeIter->second) = DeviceConfigValue;
}

// Move to the start of the next configuration. If the start is outside
// the full value string we are done.
Start = End + 1;
} while (Start < ValueStr.size());
return Result;
}

static ParsedValue getCachedValue(bool ResetCache = false) {
static ParsedValue Val = parseValue();
if (ResetCache)
Val = parseValue();
return Val;
}
};

#undef INVALID_CONFIG_EXCEPTION

} // namespace detail
} // __SYCL_INLINE_VER_NAMESPACE(_V1)
} // namespace sycl
53 changes: 44 additions & 9 deletions sycl/source/detail/reduction.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
//
//===----------------------------------------------------------------------===//

#include <detail/config.hpp>
#include <detail/queue_impl.hpp>
#include <sycl/ext/oneapi/reduction.hpp>

Expand Down Expand Up @@ -67,6 +68,7 @@ reduGetMaxWGSize(std::shared_ptr<sycl::detail::queue_impl> Queue,
size_t LocalMemBytesPerWorkItem) {
device Dev = Queue->get_device();
size_t MaxWGSize = Dev.get_info<info::device::max_work_group_size>();

size_t WGSizePerMem = MaxWGSize * 2;
size_t WGSize = MaxWGSize;
if (LocalMemBytesPerWorkItem != 0) {
Expand All @@ -93,21 +95,54 @@ reduGetMaxWGSize(std::shared_ptr<sycl::detail::queue_impl> Queue,
// the local memory assigned to one work-group by code in another work-group.
// It seems the only good solution for this work-group detection problem is
// kernel precompilation and querying the kernel properties.
if (WGSize >= 4) {
if (WGSize >= 4 && WGSizePerMem < MaxWGSize * 2) {
// Let's return a twice smaller number, but... do that only if the kernel
// is limited by memory, or the kernel uses opencl:cpu backend, which
// surprisingly uses lots of resources to run the kernels with reductions
// and often causes CL_OUT_OF_RESOURCES error even when reduction
// does not use local accessors.
if (WGSizePerMem < MaxWGSize * 2 ||
(Queue->get_device().is_cpu() &&
Queue->get_device().get_platform().get_backend() == backend::opencl))
WGSize /= 2;
// is limited by memory.
WGSize /= 2;
}

return WGSize;
}

__SYCL_EXPORT size_t reduGetPreferredWGSize(std::shared_ptr<queue_impl> &Queue,
size_t LocalMemBytesPerWorkItem) {
device Dev = Queue->get_device();

// The maximum WGSize returned by CPU devices is very large and does not
// help the reduction implementation: since all work associated with a
// work-group is typically assigned to one CPU thread, selecting a large
// work-group size unnecessarily increases the number of accumulators.
// The default of 16 was chosen based on empirical benchmarking results;
// an environment variable is provided to allow users to override this
// behavior.
using PrefWGConfig = sycl::detail::SYCLConfig<
sycl::detail::SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE>;
if (Dev.is_cpu()) {
size_t CPUMaxWGSize = PrefWGConfig::get(info::device_type::cpu);
if (CPUMaxWGSize == 0)
return 16;
size_t DevMaxWGSize = Dev.get_info<info::device::max_work_group_size>();
return std::min(CPUMaxWGSize, DevMaxWGSize);
}

// If the user has specified an explicit preferred work-group size we use
// that.
if (Dev.is_gpu() && PrefWGConfig::get(info::device_type::gpu)) {
size_t DevMaxWGSize = Dev.get_info<info::device::max_work_group_size>();
return std::min(PrefWGConfig::get(info::device_type::gpu), DevMaxWGSize);
}

if (Dev.is_accelerator() &&
PrefWGConfig::get(info::device_type::accelerator)) {
size_t DevMaxWGSize = Dev.get_info<info::device::max_work_group_size>();
return std::min(PrefWGConfig::get(info::device_type::accelerator),
DevMaxWGSize);
}

// Use the maximum work-group size otherwise.
return reduGetMaxWGSize(Queue, LocalMemBytesPerWorkItem);
}

} // namespace detail
} // namespace oneapi
} // namespace ext
Expand Down
1 change: 1 addition & 0 deletions sycl/test/abi/sycl_symbols_linux.dump
Original file line number Diff line number Diff line change
Expand Up @@ -3662,6 +3662,7 @@ _ZN4sycl3_V13ext6oneapi15filter_selectorC1ERKNSt7__cxx1112basic_stringIcSt11char
_ZN4sycl3_V13ext6oneapi15filter_selectorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
_ZN4sycl3_V13ext6oneapi6detail16reduGetMaxWGSizeESt10shared_ptrINS0_6detail10queue_implEEm
_ZN4sycl3_V13ext6oneapi6detail17reduComputeWGSizeEmmRm
_ZN4sycl3_V13ext6oneapi6detail22reduGetPreferredWGSizeERSt10shared_ptrINS0_6detail10queue_implEEm
_ZN4sycl3_V13ext6oneapi6detail33reduGetMaxNumConcurrentWorkGroupsESt10shared_ptrINS0_6detail10queue_implEE
_ZN4sycl3_V14freeEPvRKNS0_5queueERKNS0_6detail13code_locationE
_ZN4sycl3_V14freeEPvRKNS0_7contextERKNS0_6detail13code_locationE
Expand Down
1 change: 1 addition & 0 deletions sycl/unittests/config/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@ set(CMAKE_CXX_EXTENSIONS OFF)

add_sycl_unittest(ConfigTests OBJECT
ConfigTests.cpp
PreferredWGSizeConfigTests.cpp
)
Loading