This section shows the use of some SYCL features in order to perform a vector reduce operation.
reduce_naive.cpp
shows a reduction performed without using atomics or
any other SYCL feature. It uses two separate kernels to achieve device
synchronization between reductions step 1 and 2.
reduce_atomic.cpp
shows a reduction operation using local memory and
global memory atomics. Workgroups perform a local reduction using local atomics
and the result is added together using a single global memory
sycl::atomic_ref
.
reduce_group_algorithm.cpp
replaces the local memory atomic add with a call
to sycl::reduce_over_group
, which can be expected to perform a similar operation
to the device function workgroup_reduce_local_mem
as seen in
reduce_naive.cpp
.
reduce_sycl_reduction.cpp
replaces the entire logic of the reduction with a
SYCL reducer object. The reduction is a parameter of the parallel_for
invocation, and there is another reducer object used as a parameter to the
kernel lambda.