Skip to content

Commit b375b04

Browse files
authored
[SYCL][NFC] Clean formatting in Markdown documents (#1635)
- Limit string length to 80 characters - Remove trailing spaces Signed-off-by: Alexey Bader <alexey.bader@intel.com>
1 parent 291e59f commit b375b04

File tree

8 files changed

+136
-97
lines changed

8 files changed

+136
-97
lines changed

CONTRIBUTING.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -79,14 +79,16 @@ for more information.
7979
changes. See [Get Started Guide](sycl/doc/GetStartedGuide.md).
8080
- Prepare your patch
8181
- follow [LLVM coding standards](https://llvm.org/docs/CodingStandards.html)
82-
- [clang-format](https://clang.llvm.org/docs/ClangFormat.html) and
83-
[clang-tidy](https://clang.llvm.org/extra/clang-tidy/) tools can be integrated into your
84-
workflow to ensure formatting and stylistic compliance of your changes.
82+
- [clang-format](https://clang.llvm.org/docs/ClangFormat.html) and
83+
[clang-tidy](https://clang.llvm.org/extra/clang-tidy/) tools can be
84+
integrated into your workflow to ensure formatting and stylistic
85+
compliance of your changes.
8586
- use
8687
```
8788
./clang/tools/clang-format/git-clang-format `git merge-base origin/sycl HEAD`
8889
```
89-
to check the format of your current changes against the `origin/sycl` branch.
90+
to check the format of your current changes against the `origin/sycl`
91+
branch.
9092
- `-f` to also correct unstaged changes
9193
- `--diff` to only print the diff without applying
9294
- Build the project and run all tests.
@@ -125,5 +127,4 @@ Project maintainers merge pull requests using one of the following options:
125127
- [Create a merge commit] Used for LLVM pull-down PRs to preserve hashes of the
126128
commits pulled from the LLVM community repository
127129
128-
129130
*Other names and brands may be claimed as the property of others.

sycl/ReleaseNotes.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -929,8 +929,9 @@ Release notes for commit c557eb740d55e828fcf74b28d2b686c928e45318.
929929
- The problem with calling inlined kernel from multiple TUs is fixed.
930930
- Fixed compiler warnings for Intel FPGA attributes on host compilation.
931931
- Fixed bug with passing values of `vec<#, half>` type to the kernel.
932-
- Fixed buffer constructor which takes host data as shared_ptr. Now it increments
933-
shared_ptr reference counter and reuses provided memory if possible.
932+
- Fixed buffer constructor which takes host data as shared_ptr. Now it
933+
increments shared_ptr reference counter and reuses provided memory if
934+
possible.
934935
- Fixed a bug with nd_item.barrier not respecting fence_space flag
935936

936937
## Prerequisites
@@ -1001,9 +1002,9 @@ Release notes for commit 64c0262c0f0b9e1b7b2e2dcef57542a3fe3bdb97.
10011002
- Fixed code generation for 3-element boolean vectors.
10021003

10031004
## Prerequisites
1004-
- Experimental Intel(R) CPU Runtime for OpenCL(TM) Applications with SYCL support is
1005-
available now and recommended OpenCL CPU RT prerequisite for the SYCL
1006-
compiler.
1005+
- Experimental Intel(R) CPU Runtime for OpenCL(TM) Applications with SYCL
1006+
support is available now and recommended OpenCL CPU RT prerequisite for the
1007+
SYCL compiler.
10071008
- The Intel(R) Graphics Compute Runtime for OpenCL(TM) version 19.25.13237 is
10081009
recommended OpenCL GPU RT prerequisite for the SYCL compiler.
10091010

@@ -1039,7 +1040,8 @@ d404d1c6767524c21b9c5d05f11b89510abc0ab9.
10391040
- Memory attribute `intelfpga::max_concurrency` was renamed to
10401041
`intelfpga::max_private_copies` to avoid name conflict with fresh added loop
10411042
attribute
1042-
- Added support for const values and local accessors in `handler::set_arg` method.
1043+
- Added support for const values and local accessors in `handler::set_arg`
1044+
method.
10431045

10441046
## Bug Fixes
10451047
- The new scheduler is implemented with the following bug fixes:
@@ -1056,8 +1058,8 @@ d404d1c6767524c21b9c5d05f11b89510abc0ab9.
10561058
specification.
10571059
- Compiling multiple objects when using `-fsycl-link-targets` now creates proper
10581060
final .spv binary.
1059-
- Fixed bug with crash in sampler destructor when sampler object is created using
1060-
enumerations.
1061+
- Fixed bug with crash in sampler destructor when sampler object is created
1062+
using enumerations.
10611063
- Fixed `handler::set_arg`, so now it works correctly with kernels created using
10621064
program constructor which takes `cl_program` or `program::build_with_source`.
10631065
- Now `lgamma_r` builtin works correctly when application is built without
@@ -1077,7 +1079,6 @@ d404d1c6767524c21b9c5d05f11b89510abc0ab9.
10771079
OpenCL handles allocated inside SYCL(e.g. `cl_command_queue`) are not
10781080
released.
10791081

1080-
10811082
# May'19 release notes
10821083

10831084
## New Features

sycl/doc/CompilerAndRuntimeDesign.md

Lines changed: 71 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -102,17 +102,17 @@ pointers to the device memory. As there is no way in OpenCL to pass structures
102102
with pointers inside as kernel arguments all memory objects shared between host
103103
and device must be passed to the kernel as raw pointers.
104104
SYCL also has a special mechanism for passing kernel arguments from host to
105-
the device. In OpenCL kernel arguments are set by calling `clSetKernelArg` function
106-
for each kernel argument, meanwhile in SYCL all the kernel arguments are fields of
107-
"SYCL kernel function" which can be defined as a lambda function or a named function
108-
object and passed as an argument to SYCL function for invoking kernels (such as
109-
`parallel_for` or `single_task`). For example, in the previous code snippet above
110-
`accessor` `A` is one such captured kernel argument.
105+
the device. In OpenCL kernel arguments are set by calling `clSetKernelArg`
106+
function for each kernel argument, meanwhile in SYCL all the kernel arguments
107+
are fields of "SYCL kernel function" which can be defined as a lambda function
108+
or a named function object and passed as an argument to SYCL function for
109+
invoking kernels (such as `parallel_for` or `single_task`). For example, in the
110+
previous code snippet above `accessor` `A` is one such captured kernel argument.
111111
112112
To facilitate the mapping of SYCL kernel data members to OpenCL
113-
kernel arguments and overcome OpenCL limitations we added the generation of an OpenCL
114-
kernel function inside the compiler. An OpenCL kernel function contains the
115-
body of the SYCL kernel function, receives OpenCL-like parameters and
113+
kernel arguments and overcome OpenCL limitations we added the generation of an
114+
OpenCL kernel function inside the compiler. An OpenCL kernel function contains
115+
the body of the SYCL kernel function, receives OpenCL-like parameters and
116116
additionally does some manipulation to initialize SYCL kernel data members
117117
with these parameters. In some pseudo code the OpenCL kernel function for the
118118
previous code snippet above looks like this:
@@ -141,7 +141,8 @@ __kernel KernelName(global int* a) {
141141
142142
```
143143

144-
OpenCL kernel function is generated by the compiler inside the Sema using AST nodes.
144+
OpenCL kernel function is generated by the compiler inside the Sema using AST
145+
nodes.
145146

146147
### SYCL support in the driver
147148

@@ -215,12 +216,13 @@ option mechanism, similar to OpenMP.
215216

216217
`-Xsycl-target-backend=<triple> "arg1 arg2 ..."`
217218

218-
For example, to support offload to Gen9/vISA3.3, the following options would be used:
219+
For example, to support offload to Gen9/vISA3.3, the following options would be
220+
used:
219221

220222
`-fsycl -fsycl-targets=spir64_gen-unknown-unknown-sycldevice -Xsycl-target-backend "-device skl"`
221223

222-
The driver passes the `-device skl` parameter directly to the Gen device backend compiler
223-
without parsing it.
224+
The driver passes the `-device skl` parameter directly to the Gen device backend
225+
compiler without parsing it.
224226

225227
**TBD:** Having multiple code forms for the same target in the fat binary might
226228
mean invoking device compiler multiple times. Multiple invocations are not
@@ -361,27 +363,28 @@ is to allow users to save re-compile time when making changes that only affect
361363
their host code. In the case where device image generation takes a long time
362364
(e.g. FPGA), this savings can be significant.
363365
364-
For example, if the user separated source code into four files: dev_a.cpp, dev_b.cpp,
365-
host_a.cpp and host_b.cpp where only dev_a.cpp and dev_b.cpp contain device code,
366-
they can divide the compilation process into three steps:
366+
For example, if the user separated source code into four files: dev_a.cpp,
367+
dev_b.cpp, host_a.cpp and host_b.cpp where only dev_a.cpp and dev_b.cpp contain
368+
device code, they can divide the compilation process into three steps:
367369
1. Device link: dev_a.cpp dev_b.cpp -> dev_image.o (contain device image)
368370
2. Host Compile (c): host_a.cpp -> host_a.o; host_b.cpp -> host_b.o
369371
3. Linking: dev_image.o host_a.o host_b.o -> executable
370372
371373
Step 1 can take hours for some targets. But if the user wish to recompile after
372-
modifying only host_a.cpp and host_b.cpp, they can simply run steps 2 and 3 without
373-
rerunning the expensive step 1.
374+
modifying only host_a.cpp and host_b.cpp, they can simply run steps 2 and 3
375+
without rerunning the expensive step 1.
374376
375-
The compiler is responsible for verifying that the user provided all the relevant
376-
files to the device link step. There are 2 cases that have to be checked:
377+
The compiler is responsible for verifying that the user provided all the
378+
relevant files to the device link step. There are 2 cases that have to be
379+
checked:
377380
378381
1. Missing symbols referenced by the kernels present in the device link step
379382
(e.g. functions called by or global variables used by the known kernels).
380383
2. Missing kernels.
381384
382-
Case 1 can be identified in the device binary generation stage (step 1) by scanning
383-
the known kernels. Case 2 must be verified by the driver by checking for newly
384-
introduced kernels in the final link stage (step 3).
385+
Case 1 can be identified in the device binary generation stage (step 1) by
386+
scanning the known kernels. Case 2 must be verified by the driver by checking
387+
for newly introduced kernels in the final link stage (step 3).
385388
386389
The llvm-no-spir-kernel tool was introduced to facilitate checking for case 2 in
387390
the driver. It detects if a module includes kernels and is invoked as follows:
@@ -438,24 +441,40 @@ unit)
438441
439442
#### CUDA support
440443
441-
The driver supports compilation to NVPTX when the `nvptx64-nvidia-cuda-sycldevice` is passed to `-fsycl-targets`.
444+
The driver supports compilation to NVPTX when the
445+
`nvptx64-nvidia-cuda-sycldevice` is passed to `-fsycl-targets`.
442446
443-
Unlike other AOT targets, the bitcode module linked from intermediate compiled objects never goes through SPIR-V. Instead it is passed directly in bitcode form down to the NVPTX Back End. All produced bitcode depends on two libraries, `libdevice.bc` (provided by the CUDA SDK) and `libspirv-nvptx64--nvidiacl.bc` (built by the libclc project).
447+
Unlike other AOT targets, the bitcode module linked from intermediate compiled
448+
objects never goes through SPIR-V. Instead it is passed directly in bitcode form
449+
down to the NVPTX Back End. All produced bitcode depends on two libraries,
450+
`libdevice.bc` (provided by the CUDA SDK) and `libspirv-nvptx64--nvidiacl.bc`
451+
(built by the libclc project).
444452
445-
During the device linking step (device linker box in the [Separate Compilation and Linking](#separate-compilation-and-linking) illustration), llvm bitcode objects for the CUDA target are linked together alongside `libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX using the NVPTX backend, and assembled into a cubin using the `ptxas` tool (part of the CUDA SDK). The PTX file and cubin are assembled together using `fatbinary` to produce a CUDA fatbin. The CUDA fatbin is then passed to the offload wrapper tool.
453+
During the device linking step (device linker box in the
454+
[Separate Compilation and Linking](#separate-compilation-and-linking)
455+
illustration), llvm bitcode objects for the CUDA target are linked together
456+
alongside `libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX
457+
using the NVPTX backend, and assembled into a cubin using the `ptxas` tool (part
458+
of the CUDA SDK). The PTX file and cubin are assembled together using
459+
`fatbinary` to produce a CUDA fatbin. The CUDA fatbin is then passed to the
460+
offload wrapper tool.
446461
447462
##### Checking if the compiler is targeting NVPTX
448463
449-
When the SYCL compiler is in device mode and targeting the NVPTX backend, compiler defines the macro `__SYCL_NVPTX__`.
450-
This macro can safely be used to enable NVPTX specific code path in SYCL kernels.
464+
When the SYCL compiler is in device mode and targeting the NVPTX backend,
465+
compiler defines the macro `__SYCL_NVPTX__`.
466+
This macro can safely be used to enable NVPTX specific code path in SYCL
467+
kernels.
451468
452469
*Note: this macro is only define during the device compilation phase.*
453470
454471
##### NVPTX Builtins
455472
456-
When the SYCL compiler is in device mode and targeting the NVPTX backend, the compiler exposes NVPTX builtins supported by clang.
473+
When the SYCL compiler is in device mode and targeting the NVPTX backend, the
474+
compiler exposes NVPTX builtins supported by clang.
457475
458-
*Note: this enable NVPTX specific features which cannot be supported by other targets or the host.*
476+
*Note: this enable NVPTX specific features which cannot be supported by other
477+
targets or the host.*
459478
460479
Example:
461480
```cpp
@@ -472,16 +491,24 @@ double my_min(double x, double y) {
472491

473492
##### Local memory support
474493

475-
In CUDA, users can only allocate one chunk of host allocated shared memory (which maps to SYCL's local accessors).
476-
This chunk of memory is allocated as an array `extern __shared__ <type> <name>[];` which LLVM represents as an external global symbol to the CUDA shared memory address space.
477-
The NVPTX backend then lowers this into a `.extern .shared .align 4 .b8` PTX instruction.
494+
In CUDA, users can only allocate one chunk of host allocated shared memory
495+
(which maps to SYCL's local accessors). This chunk of memory is allocated as an
496+
array `extern __shared__ <type> <name>[];` which LLVM represents as an external
497+
global symbol to the CUDA shared memory address space. The NVPTX backend then
498+
lowers this into a `.extern .shared .align 4 .b8` PTX instruction.
478499

479-
In SYCL, users can allocate multiple local accessors and pass them as kernel parameters. When the SYCL frontend lowers the SYCL kernel invocation into an OpenCL compliant kernel entry, it lowers local accessors into a pointer to OpenCL local memory (CUDA shared memory) but this is not legal for CUDA kernels.
500+
In SYCL, users can allocate multiple local accessors and pass them as kernel
501+
parameters. When the SYCL frontend lowers the SYCL kernel invocation into an
502+
OpenCL compliant kernel entry, it lowers local accessors into a pointer to
503+
OpenCL local memory (CUDA shared memory) but this is not legal for CUDA kernels.
480504

481-
To legalize the SYCL lowering for CUDA, a SYCL for CUDA specific pass will do the following:
505+
To legalize the SYCL lowering for CUDA, a SYCL for CUDA specific pass will do
506+
the following:
482507
- Create a global symbol to the CUDA shared memory address space
483-
- Transform all pointers to CUDA shared memory into a 32 bit integer representing the offset in bytes to use with the global symbol
484-
- Replace all uses of the transformed pointers by the address to global symbol offset by the value of the integer passed as parameter
508+
- Transform all pointers to CUDA shared memory into a 32 bit integer
509+
representing the offset in bytes to use with the global symbol
510+
- Replace all uses of the transformed pointers by the address to global symbol
511+
offset by the value of the integer passed as parameter
485512

486513
As an example, the following kernel:
487514
```
@@ -490,6 +517,7 @@ define void @SYCL_generated_kernel(i64 addrspace(3)* nocapture %local_ptr, i32 %
490517
%1 = load i64, i64 addrspace(3)* %local_ptr2
491518
}
492519
```
520+
493521
Is transformed into this kernel when targeting CUDA:
494522
```
495523
@SYCL_generated_kernel.shared_mem = external dso_local local_unnamed_addr addrspace(3) global [0 x i8], align 4
@@ -502,7 +530,10 @@ define void @SYCL_generated_kernel(i32 %local_ptr_offset, i32 %arg, i32 %local_p
502530
}
503531
```
504532

505-
On the runtime side, when setting local memory arguments, the CUDA PI implementation will internally set the argument as the offset with respect to the accumulated size of used local memory. This approach preserves the exisiting PI interface.
533+
On the runtime side, when setting local memory arguments, the CUDA PI
534+
implementation will internally set the argument as the offset with respect to
535+
the accumulated size of used local memory. This approach preserves the exisiting
536+
PI interface.
506537

507538
### Integration with SPIR-V format
508539

@@ -537,8 +568,8 @@ Translation from LLVM IR to SPIR-V for special types is also supported, but
537568
such LLVM IR must comply to some special requirements. Unfortunately there is
538569
no canonical form of special built-in types and operations in LLVM IR, moreover
539570
we can't re-use existing representation generated by OpenCL C front-end
540-
compiler. For instance here is how `OpGroupAsyncCopy` operation looks in LLVM IR
541-
produced by OpenCL C front-end compiler.
571+
compiler. For instance here is how `OpGroupAsyncCopy` operation looks in LLVM
572+
IR produced by OpenCL C front-end compiler.
542573

543574
```LLVM
544575
@_Z21async_work_group_copyPU3AS3fPU3AS1Kfjj(float addrspace(3)*, float addrspace(1)*, i32, i32)

sycl/doc/GetStartedGuide.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -130,8 +130,8 @@ To enable support for CUDA devices, follow the instructions for the Linux
130130
DPC++ toolchain, but add the `--cuda` flag to `configure.py`
131131

132132
Enabling this flag requires an installation of
133-
[CUDA 10.1](https://developer.nvidia.com/cuda-10.1-download-archive-update2) on the system,
134-
refer to
133+
[CUDA 10.1](https://developer.nvidia.com/cuda-10.1-download-archive-update2) on
134+
the system, refer to
135135
[NVIDIA CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html).
136136

137137
Currently, the only combination tested is Ubuntu 18.04 with CUDA 10.2 using
@@ -145,17 +145,18 @@ above.
145145
The DPC++ toolchain support on CUDA platforms is still in an experimental phase.
146146
Currently, the DPC++ toolchain relies on having a recent OpenCL implementation
147147
on the system in order to link applications to the DPC++ runtime.
148-
The OpenCL implementation is not used at runtime if only the CUDA backend is
148+
The OpenCL implementation is not used at runtime if only the CUDA backend is
149149
used in the application, but must be installed.
150150

151151
The OpenCL implementation provided by the CUDA SDK is OpenCL 1.2, which is
152152
too old to link with the DPC++ runtime and lacks some symbols.
153153

154-
We recommend installing the low level CPU runtime, following the instructions
154+
We recommend installing the low level CPU runtime, following the instructions
155155
in the next section.
156156

157-
Instead of installing the low level CPU runtime, it is possible to build and
158-
install the [Khronos ICD loader](https://github.com/KhronosGroup/OpenCL-ICD-Loader),
157+
Instead of installing the low level CPU runtime, it is possible to build and
158+
install the
159+
[Khronos ICD loader](https://github.com/KhronosGroup/OpenCL-ICD-Loader),
159160
which contains all the symbols required.
160161

161162
### Install low level runtime
@@ -276,7 +277,7 @@ python %DPCPP_HOME%\llvm\buildbot\check.py
276277
If no OpenCL GPU/CPU runtimes are available, the corresponding tests are
277278
skipped.
278279

279-
If CUDA support has been built, it is tested only if there are CUDA devices
280+
If CUDA support has been built, it is tested only if there are CUDA devices
280281
available.
281282

282283
#### Run Khronos\* SYCL\* conformance test suite (optional)
@@ -411,7 +412,7 @@ clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice \
411412
This `simple-sycl-app.exe` application doesn't specify SYCL device for
412413
execution, so SYCL runtime will use `default_selector` logic to select one
413414
of accelerators available in the system or SYCL host device.
414-
In this case, the behaviour of the `default_selector` can be altered
415+
In this case, the behaviour of the `default_selector` can be altered
415416
using the `SYCL_BE` environment variable, setting `PI_CUDA` forces
416417
the usage of the CUDA backend (if available), `PI_OPENCL` will
417418
force the usage of the OpenCL backend.
@@ -543,5 +544,4 @@ class CUDASelector : public cl::sycl::device_selector {
543544
- SYCL\* 1.2.1 specification:
544545
[www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf](https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf)
545546

546-
547547
\*Other names and brands may be claimed as the property of others.

0 commit comments

Comments
 (0)