Skip to content

Commit

Permalink
[SYCL][NFC] Clean formatting in Markdown documents (#1635)
Browse files Browse the repository at this point in the history
- Limit string length to 80 characters
- Remove trailing spaces

Signed-off-by: Alexey Bader <alexey.bader@intel.com>
  • Loading branch information
bader authored May 5, 2020
1 parent 291e59f commit b375b04
Show file tree
Hide file tree
Showing 8 changed files with 136 additions and 97 deletions.
11 changes: 6 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,14 +79,16 @@ for more information.
changes. See [Get Started Guide](sycl/doc/GetStartedGuide.md).
- Prepare your patch
- follow [LLVM coding standards](https://llvm.org/docs/CodingStandards.html)
- [clang-format](https://clang.llvm.org/docs/ClangFormat.html) and
[clang-tidy](https://clang.llvm.org/extra/clang-tidy/) tools can be integrated into your
workflow to ensure formatting and stylistic compliance of your changes.
- [clang-format](https://clang.llvm.org/docs/ClangFormat.html) and
[clang-tidy](https://clang.llvm.org/extra/clang-tidy/) tools can be
integrated into your workflow to ensure formatting and stylistic
compliance of your changes.
- use
```
./clang/tools/clang-format/git-clang-format `git merge-base origin/sycl HEAD`
```
to check the format of your current changes against the `origin/sycl` branch.
to check the format of your current changes against the `origin/sycl`
branch.
- `-f` to also correct unstaged changes
- `--diff` to only print the diff without applying
- Build the project and run all tests.
Expand Down Expand Up @@ -125,5 +127,4 @@ Project maintainers merge pull requests using one of the following options:
- [Create a merge commit] Used for LLVM pull-down PRs to preserve hashes of the
commits pulled from the LLVM community repository
*Other names and brands may be claimed as the property of others.
19 changes: 10 additions & 9 deletions sycl/ReleaseNotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -929,8 +929,9 @@ Release notes for commit c557eb740d55e828fcf74b28d2b686c928e45318.
- The problem with calling inlined kernel from multiple TUs is fixed.
- Fixed compiler warnings for Intel FPGA attributes on host compilation.
- Fixed bug with passing values of `vec<#, half>` type to the kernel.
- Fixed buffer constructor which takes host data as shared_ptr. Now it increments
shared_ptr reference counter and reuses provided memory if possible.
- Fixed buffer constructor which takes host data as shared_ptr. Now it
increments shared_ptr reference counter and reuses provided memory if
possible.
- Fixed a bug with nd_item.barrier not respecting fence_space flag

## Prerequisites
Expand Down Expand Up @@ -1001,9 +1002,9 @@ Release notes for commit 64c0262c0f0b9e1b7b2e2dcef57542a3fe3bdb97.
- Fixed code generation for 3-element boolean vectors.

## Prerequisites
- Experimental Intel(R) CPU Runtime for OpenCL(TM) Applications with SYCL support is
available now and recommended OpenCL CPU RT prerequisite for the SYCL
compiler.
- Experimental Intel(R) CPU Runtime for OpenCL(TM) Applications with SYCL
support is available now and recommended OpenCL CPU RT prerequisite for the
SYCL compiler.
- The Intel(R) Graphics Compute Runtime for OpenCL(TM) version 19.25.13237 is
recommended OpenCL GPU RT prerequisite for the SYCL compiler.

Expand Down Expand Up @@ -1039,7 +1040,8 @@ d404d1c6767524c21b9c5d05f11b89510abc0ab9.
- Memory attribute `intelfpga::max_concurrency` was renamed to
`intelfpga::max_private_copies` to avoid name conflict with fresh added loop
attribute
- Added support for const values and local accessors in `handler::set_arg` method.
- Added support for const values and local accessors in `handler::set_arg`
method.

## Bug Fixes
- The new scheduler is implemented with the following bug fixes:
Expand All @@ -1056,8 +1058,8 @@ d404d1c6767524c21b9c5d05f11b89510abc0ab9.
specification.
- Compiling multiple objects when using `-fsycl-link-targets` now creates proper
final .spv binary.
- Fixed bug with crash in sampler destructor when sampler object is created using
enumerations.
- Fixed bug with crash in sampler destructor when sampler object is created
using enumerations.
- Fixed `handler::set_arg`, so now it works correctly with kernels created using
program constructor which takes `cl_program` or `program::build_with_source`.
- Now `lgamma_r` builtin works correctly when application is built without
Expand All @@ -1077,7 +1079,6 @@ d404d1c6767524c21b9c5d05f11b89510abc0ab9.
OpenCL handles allocated inside SYCL(e.g. `cl_command_queue`) are not
released.


# May'19 release notes

## New Features
Expand Down
111 changes: 71 additions & 40 deletions sycl/doc/CompilerAndRuntimeDesign.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,17 +102,17 @@ pointers to the device memory. As there is no way in OpenCL to pass structures
with pointers inside as kernel arguments all memory objects shared between host
and device must be passed to the kernel as raw pointers.
SYCL also has a special mechanism for passing kernel arguments from host to
the device. In OpenCL kernel arguments are set by calling `clSetKernelArg` function
for each kernel argument, meanwhile in SYCL all the kernel arguments are fields of
"SYCL kernel function" which can be defined as a lambda function or a named function
object and passed as an argument to SYCL function for invoking kernels (such as
`parallel_for` or `single_task`). For example, in the previous code snippet above
`accessor` `A` is one such captured kernel argument.
the device. In OpenCL kernel arguments are set by calling `clSetKernelArg`
function for each kernel argument, meanwhile in SYCL all the kernel arguments
are fields of "SYCL kernel function" which can be defined as a lambda function
or a named function object and passed as an argument to SYCL function for
invoking kernels (such as `parallel_for` or `single_task`). For example, in the
previous code snippet above `accessor` `A` is one such captured kernel argument.
To facilitate the mapping of SYCL kernel data members to OpenCL
kernel arguments and overcome OpenCL limitations we added the generation of an OpenCL
kernel function inside the compiler. An OpenCL kernel function contains the
body of the SYCL kernel function, receives OpenCL-like parameters and
kernel arguments and overcome OpenCL limitations we added the generation of an
OpenCL kernel function inside the compiler. An OpenCL kernel function contains
the body of the SYCL kernel function, receives OpenCL-like parameters and
additionally does some manipulation to initialize SYCL kernel data members
with these parameters. In some pseudo code the OpenCL kernel function for the
previous code snippet above looks like this:
Expand Down Expand Up @@ -141,7 +141,8 @@ __kernel KernelName(global int* a) {
```

OpenCL kernel function is generated by the compiler inside the Sema using AST nodes.
OpenCL kernel function is generated by the compiler inside the Sema using AST
nodes.

### SYCL support in the driver

Expand Down Expand Up @@ -215,12 +216,13 @@ option mechanism, similar to OpenMP.

`-Xsycl-target-backend=<triple> "arg1 arg2 ..."`

For example, to support offload to Gen9/vISA3.3, the following options would be used:
For example, to support offload to Gen9/vISA3.3, the following options would be
used:

`-fsycl -fsycl-targets=spir64_gen-unknown-unknown-sycldevice -Xsycl-target-backend "-device skl"`

The driver passes the `-device skl` parameter directly to the Gen device backend compiler
without parsing it.
The driver passes the `-device skl` parameter directly to the Gen device backend
compiler without parsing it.

**TBD:** Having multiple code forms for the same target in the fat binary might
mean invoking device compiler multiple times. Multiple invocations are not
Expand Down Expand Up @@ -361,27 +363,28 @@ is to allow users to save re-compile time when making changes that only affect
their host code. In the case where device image generation takes a long time
(e.g. FPGA), this savings can be significant.
For example, if the user separated source code into four files: dev_a.cpp, dev_b.cpp,
host_a.cpp and host_b.cpp where only dev_a.cpp and dev_b.cpp contain device code,
they can divide the compilation process into three steps:
For example, if the user separated source code into four files: dev_a.cpp,
dev_b.cpp, host_a.cpp and host_b.cpp where only dev_a.cpp and dev_b.cpp contain
device code, they can divide the compilation process into three steps:
1. Device link: dev_a.cpp dev_b.cpp -> dev_image.o (contain device image)
2. Host Compile (c): host_a.cpp -> host_a.o; host_b.cpp -> host_b.o
3. Linking: dev_image.o host_a.o host_b.o -> executable
Step 1 can take hours for some targets. But if the user wish to recompile after
modifying only host_a.cpp and host_b.cpp, they can simply run steps 2 and 3 without
rerunning the expensive step 1.
modifying only host_a.cpp and host_b.cpp, they can simply run steps 2 and 3
without rerunning the expensive step 1.
The compiler is responsible for verifying that the user provided all the relevant
files to the device link step. There are 2 cases that have to be checked:
The compiler is responsible for verifying that the user provided all the
relevant files to the device link step. There are 2 cases that have to be
checked:
1. Missing symbols referenced by the kernels present in the device link step
(e.g. functions called by or global variables used by the known kernels).
2. Missing kernels.
Case 1 can be identified in the device binary generation stage (step 1) by scanning
the known kernels. Case 2 must be verified by the driver by checking for newly
introduced kernels in the final link stage (step 3).
Case 1 can be identified in the device binary generation stage (step 1) by
scanning the known kernels. Case 2 must be verified by the driver by checking
for newly introduced kernels in the final link stage (step 3).
The llvm-no-spir-kernel tool was introduced to facilitate checking for case 2 in
the driver. It detects if a module includes kernels and is invoked as follows:
Expand Down Expand Up @@ -438,24 +441,40 @@ unit)
#### CUDA support
The driver supports compilation to NVPTX when the `nvptx64-nvidia-cuda-sycldevice` is passed to `-fsycl-targets`.
The driver supports compilation to NVPTX when the
`nvptx64-nvidia-cuda-sycldevice` is passed to `-fsycl-targets`.
Unlike other AOT targets, the bitcode module linked from intermediate compiled objects never goes through SPIR-V. Instead it is passed directly in bitcode form down to the NVPTX Back End. All produced bitcode depends on two libraries, `libdevice.bc` (provided by the CUDA SDK) and `libspirv-nvptx64--nvidiacl.bc` (built by the libclc project).
Unlike other AOT targets, the bitcode module linked from intermediate compiled
objects never goes through SPIR-V. Instead it is passed directly in bitcode form
down to the NVPTX Back End. All produced bitcode depends on two libraries,
`libdevice.bc` (provided by the CUDA SDK) and `libspirv-nvptx64--nvidiacl.bc`
(built by the libclc project).
During the device linking step (device linker box in the [Separate Compilation and Linking](#separate-compilation-and-linking) illustration), llvm bitcode objects for the CUDA target are linked together alongside `libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX using the NVPTX backend, and assembled into a cubin using the `ptxas` tool (part of the CUDA SDK). The PTX file and cubin are assembled together using `fatbinary` to produce a CUDA fatbin. The CUDA fatbin is then passed to the offload wrapper tool.
During the device linking step (device linker box in the
[Separate Compilation and Linking](#separate-compilation-and-linking)
illustration), llvm bitcode objects for the CUDA target are linked together
alongside `libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX
using the NVPTX backend, and assembled into a cubin using the `ptxas` tool (part
of the CUDA SDK). The PTX file and cubin are assembled together using
`fatbinary` to produce a CUDA fatbin. The CUDA fatbin is then passed to the
offload wrapper tool.
##### Checking if the compiler is targeting NVPTX
When the SYCL compiler is in device mode and targeting the NVPTX backend, compiler defines the macro `__SYCL_NVPTX__`.
This macro can safely be used to enable NVPTX specific code path in SYCL kernels.
When the SYCL compiler is in device mode and targeting the NVPTX backend,
compiler defines the macro `__SYCL_NVPTX__`.
This macro can safely be used to enable NVPTX specific code path in SYCL
kernels.
*Note: this macro is only define during the device compilation phase.*
##### NVPTX Builtins
When the SYCL compiler is in device mode and targeting the NVPTX backend, the compiler exposes NVPTX builtins supported by clang.
When the SYCL compiler is in device mode and targeting the NVPTX backend, the
compiler exposes NVPTX builtins supported by clang.
*Note: this enable NVPTX specific features which cannot be supported by other targets or the host.*
*Note: this enable NVPTX specific features which cannot be supported by other
targets or the host.*
Example:
```cpp
Expand All @@ -472,16 +491,24 @@ double my_min(double x, double y) {

##### Local memory support

In CUDA, users can only allocate one chunk of host allocated shared memory (which maps to SYCL's local accessors).
This chunk of memory is allocated as an array `extern __shared__ <type> <name>[];` which LLVM represents as an external global symbol to the CUDA shared memory address space.
The NVPTX backend then lowers this into a `.extern .shared .align 4 .b8` PTX instruction.
In CUDA, users can only allocate one chunk of host allocated shared memory
(which maps to SYCL's local accessors). This chunk of memory is allocated as an
array `extern __shared__ <type> <name>[];` which LLVM represents as an external
global symbol to the CUDA shared memory address space. The NVPTX backend then
lowers this into a `.extern .shared .align 4 .b8` PTX instruction.

In SYCL, users can allocate multiple local accessors and pass them as kernel parameters. When the SYCL frontend lowers the SYCL kernel invocation into an OpenCL compliant kernel entry, it lowers local accessors into a pointer to OpenCL local memory (CUDA shared memory) but this is not legal for CUDA kernels.
In SYCL, users can allocate multiple local accessors and pass them as kernel
parameters. When the SYCL frontend lowers the SYCL kernel invocation into an
OpenCL compliant kernel entry, it lowers local accessors into a pointer to
OpenCL local memory (CUDA shared memory) but this is not legal for CUDA kernels.

To legalize the SYCL lowering for CUDA, a SYCL for CUDA specific pass will do the following:
To legalize the SYCL lowering for CUDA, a SYCL for CUDA specific pass will do
the following:
- Create a global symbol to the CUDA shared memory address space
- Transform all pointers to CUDA shared memory into a 32 bit integer representing the offset in bytes to use with the global symbol
- Replace all uses of the transformed pointers by the address to global symbol offset by the value of the integer passed as parameter
- Transform all pointers to CUDA shared memory into a 32 bit integer
representing the offset in bytes to use with the global symbol
- Replace all uses of the transformed pointers by the address to global symbol
offset by the value of the integer passed as parameter

As an example, the following kernel:
```
Expand All @@ -490,6 +517,7 @@ define void @SYCL_generated_kernel(i64 addrspace(3)* nocapture %local_ptr, i32 %
%1 = load i64, i64 addrspace(3)* %local_ptr2
}
```

Is transformed into this kernel when targeting CUDA:
```
@SYCL_generated_kernel.shared_mem = external dso_local local_unnamed_addr addrspace(3) global [0 x i8], align 4
Expand All @@ -502,7 +530,10 @@ define void @SYCL_generated_kernel(i32 %local_ptr_offset, i32 %arg, i32 %local_p
}
```

On the runtime side, when setting local memory arguments, the CUDA PI implementation will internally set the argument as the offset with respect to the accumulated size of used local memory. This approach preserves the exisiting PI interface.
On the runtime side, when setting local memory arguments, the CUDA PI
implementation will internally set the argument as the offset with respect to
the accumulated size of used local memory. This approach preserves the exisiting
PI interface.

### Integration with SPIR-V format

Expand Down Expand Up @@ -537,8 +568,8 @@ Translation from LLVM IR to SPIR-V for special types is also supported, but
such LLVM IR must comply to some special requirements. Unfortunately there is
no canonical form of special built-in types and operations in LLVM IR, moreover
we can't re-use existing representation generated by OpenCL C front-end
compiler. For instance here is how `OpGroupAsyncCopy` operation looks in LLVM IR
produced by OpenCL C front-end compiler.
compiler. For instance here is how `OpGroupAsyncCopy` operation looks in LLVM
IR produced by OpenCL C front-end compiler.

```LLVM
@_Z21async_work_group_copyPU3AS3fPU3AS1Kfjj(float addrspace(3)*, float addrspace(1)*, i32, i32)
Expand Down
18 changes: 9 additions & 9 deletions sycl/doc/GetStartedGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,8 +130,8 @@ To enable support for CUDA devices, follow the instructions for the Linux
DPC++ toolchain, but add the `--cuda` flag to `configure.py`

Enabling this flag requires an installation of
[CUDA 10.1](https://developer.nvidia.com/cuda-10.1-download-archive-update2) on the system,
refer to
[CUDA 10.1](https://developer.nvidia.com/cuda-10.1-download-archive-update2) on
the system, refer to
[NVIDIA CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html).

Currently, the only combination tested is Ubuntu 18.04 with CUDA 10.2 using
Expand All @@ -145,17 +145,18 @@ above.
The DPC++ toolchain support on CUDA platforms is still in an experimental phase.
Currently, the DPC++ toolchain relies on having a recent OpenCL implementation
on the system in order to link applications to the DPC++ runtime.
The OpenCL implementation is not used at runtime if only the CUDA backend is
The OpenCL implementation is not used at runtime if only the CUDA backend is
used in the application, but must be installed.

The OpenCL implementation provided by the CUDA SDK is OpenCL 1.2, which is
too old to link with the DPC++ runtime and lacks some symbols.

We recommend installing the low level CPU runtime, following the instructions
We recommend installing the low level CPU runtime, following the instructions
in the next section.

Instead of installing the low level CPU runtime, it is possible to build and
install the [Khronos ICD loader](https://github.com/KhronosGroup/OpenCL-ICD-Loader),
Instead of installing the low level CPU runtime, it is possible to build and
install the
[Khronos ICD loader](https://github.com/KhronosGroup/OpenCL-ICD-Loader),
which contains all the symbols required.

### Install low level runtime
Expand Down Expand Up @@ -276,7 +277,7 @@ python %DPCPP_HOME%\llvm\buildbot\check.py
If no OpenCL GPU/CPU runtimes are available, the corresponding tests are
skipped.

If CUDA support has been built, it is tested only if there are CUDA devices
If CUDA support has been built, it is tested only if there are CUDA devices
available.

#### Run Khronos\* SYCL\* conformance test suite (optional)
Expand Down Expand Up @@ -411,7 +412,7 @@ clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice \
This `simple-sycl-app.exe` application doesn't specify SYCL device for
execution, so SYCL runtime will use `default_selector` logic to select one
of accelerators available in the system or SYCL host device.
In this case, the behaviour of the `default_selector` can be altered
In this case, the behaviour of the `default_selector` can be altered
using the `SYCL_BE` environment variable, setting `PI_CUDA` forces
the usage of the CUDA backend (if available), `PI_OPENCL` will
force the usage of the OpenCL backend.
Expand Down Expand Up @@ -543,5 +544,4 @@ class CUDASelector : public cl::sycl::device_selector {
- SYCL\* 1.2.1 specification:
[www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf](https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf)


\*Other names and brands may be claimed as the property of others.
Loading

0 comments on commit b375b04

Please sign in to comment.