In this exercise you will learn how to use the USM API to write a SYCL application which performs a vector add.
Allocate memory on the host for the input and out data just as you did when using the buffer/accessor model.
Create a queue
using the USM device selector from exercise 7, remember to
handle errors.
When using the USM model the first thing you need to do is allocate the USM device memory.
To do this call malloc_device
to allocate memory for the two inputs and the
output.
Before you can perform any computation on the data you must copy it to the device.
To do this call the queue
member function memcpy
for each of the two inputs,
remember to call wait
on the event
that is returned.
Now you can define the kernel function itself, which is largely the same as in exercise 7.
This can be done differently from the buffer/accessor model, by calling the
shortcut member function parallel_for
on the queue
rather than creating a
command group.
Note that as you are accessing a pointer rather than an accessor
you must
retrieve an integral index, which can be done by calling the subscript operator
on the id
passed into the kernel function, with index 0.
Remember to name your kernel function, and to call wait
on the event
that is
returned.
Once the kernel function has completed you can copy the result back to the host.
As you did when copying to the device, you can do this by calling the queue
member function memcpy
, again remember to call wait
on the event
that is
returned.
Finally once you have copied the data back from the device you can free that memory.
To do this call free
on each of the USM device allocations, note this is the
SYCL API free
and not the standard C free
.
For DPC++: Using CMake to configure then build the exercise:
mkdir build
cd build
cmake .. "-GUnix Makefiles" -DSYCL_ACADEMY_USE_DPCPP=ON -DSYCL_ACADEMY_ENABLE_SOLUTIONS=OFF -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
make exercise_8
Alternatively from a terminal at the command line:
icpx -fsycl -o sycl-ex-8 -I../External/Catch2/single_include ../Code_Exercises/Exercise_08_USM_Vector_Add/source.cpp
./sycl-ex-8
In Intel DevCloud, to run computational applications, you will submit jobs to a queue for execution on compute nodes, especially some features like longer walltime and multi-node computation is only available through the job queue. Please refer to the guide.
So wrap the binary into a script job_submission
and run:
qsub job_submission
For AdaptiveCpp:
# <target specification> is a list of backends and devices to target, for example
# "omp;generic" compiles for CPUs with the OpenMP backend and GPUs using the generic single-pass compiler.
# The simplest target specification is "omp" which compiles for CPUs using the OpenMP backend.
cmake -DSYCL_ACADEMY_USE_ADAPTIVECPP=ON -DSYCL_ACADEMY_INSTALL_ROOT=/insert/path/to/adaptivecpp -DACPP_TARGETS="<target specification>" ..
make exercise_8
alternatively, without CMake:
cd Code_Exercises/Exercise_08_USM_Vector_Add
/path/to/adaptivecpp/bin/acpp -o sycl-ex-8 -I../../External/Catch2/single_include --acpp-targets="<target specification>" source.cpp
./sycl-ex-8