Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][DOC][CUDA][HIP] Update getStartedGuide.md #10669

Merged
merged 12 commits into from
Aug 28, 2023
Next Next commit
Updated nvidia and hip docs in getStartedGuide.md
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
  • Loading branch information
JackAKirk committed Aug 2, 2023
commit 17633f48094bd1254f77540c845685e320492a84
134 changes: 58 additions & 76 deletions sycl/doc/GetStartedGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,36 +5,38 @@ and a wide range of compute accelerators such as GPU and FPGA.

## Table of contents

* [Prerequisites](#prerequisites)
* [Create DPC++ workspace](#create-dpc-workspace)
* [Build DPC++ toolchain](#build-dpc-toolchain)
* [Build DPC++ toolchain with libc++ library](#build-dpc-toolchain-with-libc-library)
* [Build DPC++ toolchain with support for NVIDIA CUDA](#build-dpc-toolchain-with-support-for-nvidia-cuda)
* [Build DPC++ toolchain with support for HIP AMD](#build-dpc-toolchain-with-support-for-hip-amd)
* [Build DPC++ toolchain with support for HIP NVIDIA](#build-dpc-toolchain-with-support-for-hip-nvidia)
* [Build DPC++ toolchain with support for ESIMD CPU Emulation](#build-dpc-toolchain-with-support-for-esimd-cpu-emulation)
* [Build DPC++ toolchain with support for runtime kernel fusion](#build-dpc-toolchain-with-support-for-runtime-kernel-fusion)
* [Build Doxygen documentation](#build-doxygen-documentation)
* [Deployment](#deployment)
* [Use DPC++ toolchain](#use-dpc-toolchain)
* [Install low level runtime](#install-low-level-runtime)
* [Obtain prerequisites for ahead of time (AOT) compilation](#obtain-prerequisites-for-ahead-of-time-aot-compilation)
* [GPU](#gpu)
* [CPU](#cpu)
* [Accelerator](#accelerator)
* [Test DPC++ toolchain](#test-dpc-toolchain)
* [Run in-tree LIT tests](#run-in-tree-lit-tests)
* [Run DPC++ E2E test suite](#run-dpc-e2e-test-suite)
* [Run Khronos\* SYCL\* conformance test suite (optional)](#run-khronos-sycl-conformance-test-suite-optional)
* [Run simple DPC++ application](#run-simple-dpc-application)
* [Build DPC++ application with CMake](#build-dpc-application-with-cmake)
* [Code the program for a specific GPU](#code-the-program-for-a-specific-gpu)
* [Using the DPC++ toolchain on CUDA platforms](#using-the-dpc-toolchain-on-cuda-platforms)
* [C++ standard](#c-standard)
* [Known Issues and Limitations](#known-issues-and-limitations)
* [CUDA back-end limitations](#cuda-back-end-limitations)
* [HIP back-end limitations](#hip-back-end-limitations)
* [Find More](#find-more)
- [Getting Started with oneAPI DPC++](#getting-started-with-oneapi-dpc)
- [Table of contents](#table-of-contents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these two items are useful. I guess we intentionally skipped them.

What is the reason for changing * with -?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I've removed them.

I didn't notice the * -> - change, I think the lint did this. I've changed it back.

- [Prerequisites](#prerequisites)
- [Create DPC++ workspace](#create-dpc-workspace)
- [Build DPC++ toolchain](#build-dpc-toolchain)
- [Build DPC++ toolchain with libc++ library](#build-dpc-toolchain-with-libc-library)
- [Build DPC++ toolchain with support for NVIDIA CUDA](#build-dpc-toolchain-with-support-for-nvidia-cuda)
- [Build DPC++ toolchain with support for HIP AMD](#build-dpc-toolchain-with-support-for-hip-amd)
- [Build DPC++ toolchain with support for HIP NVIDIA](#build-dpc-toolchain-with-support-for-hip-nvidia)
- [Build DPC++ toolchain with support for ESIMD CPU Emulation](#build-dpc-toolchain-with-support-for-esimd-cpu-emulation)
- [Build DPC++ toolchain with support for runtime kernel fusion](#build-dpc-toolchain-with-support-for-runtime-kernel-fusion)
- [Build Doxygen documentation](#build-doxygen-documentation)
- [Deployment](#deployment)
- [Use DPC++ toolchain](#use-dpc-toolchain)
- [Install low level runtime](#install-low-level-runtime)
- [Obtain prerequisites for ahead of time (AOT) compilation](#obtain-prerequisites-for-ahead-of-time-aot-compilation)
- [GPU](#gpu)
- [CPU](#cpu)
- [Accelerator](#accelerator)
- [Test DPC++ toolchain](#test-dpc-toolchain)
- [Run in-tree LIT tests](#run-in-tree-lit-tests)
- [Run DPC++ E2E tests](#run-dpc-e2e-tests)
- [Run Khronos\* SYCL\* conformance test suite (optional)](#run-khronos-sycl-conformance-test-suite-optional)
- [Run simple DPC++ application](#run-simple-dpc-application)
- [AOT Target architectures](#aot-target-architectures)
- [Build DPC++ application with CMake](#build-dpc-application-with-cmake)
- [Code the program for a specific GPU](#code-the-program-for-a-specific-gpu)
- [C++ standard](#c-standard)
- [Known Issues and Limitations](#known-issues-and-limitations)
- [CUDA back-end limitations](#cuda-back-end-limitations)
- [HIP back-end limitations](#hip-back-end-limitations)
- [Find More](#find-more)

## Prerequisites

Expand Down Expand Up @@ -173,7 +175,7 @@ the CUDA backend has Windows support; windows subsystem for
linux (WSL) is not needed to build and run the CUDA backend.

Enabling this flag requires an installation of at least
[CUDA 10.2](https://developer.nvidia.com/cuda-10.2-download-archive) on
[CUDA 11.5](https://developer.nvidia.com/cuda-11-5-0-download-archive) on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to 11.0? Our plugins are aimed to target 11.0 so would be good if we had the same support in open source

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah is it definitely working now for 11.0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it work for e.g. 10.2 or even earlier? Ideally we can support as many cuda versions as possible. It's been pointed out many times that clusters don't update their versions for long periods e.g. #10858 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as we keep intermittently checking that CUDA 10.2 is working we can keep this support. But the problem is that we don't have any CI for CUDA 10.2 whereas we do have CI for CUDA 11.0 (to test the releases internally)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are version checks for above/below 10.2 here:

// CU_POINTER_ATTRIBUTE_RANGE_START_ADDR was introduced in CUDA 10.2

Maybe with the right checks it < 10.2 could work fine. Or maybe it already works.

Copy link
Contributor

@hdelan hdelan Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that it already works. But we don't have CI to check every commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intel/llvm CI started with 11.4 version.
We switched to 11.7 due to build issues of SYCL-CTS. #6545
Today we validate each commit with 12.1 version, but I unfortunately, I didn't record the reason for uplift from 11.7.

I would say that it already works. But we don't have CI to check every commit

Do we need to use 10.2 in CI instead of 12.1?
IIRC, all previous uplifts were agreed with @AerialMantis.

Copy link
Contributor Author

@JackAKirk JackAKirk Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that from the next release we are meant to be supporting 12.1, so it makes sense the CI tests that as the priority. It would be nice to keep track of whether older releases still work, so that systems that don't regularly update their toolkit can still stay supported; but I think that is more on codeplay to test from time to time, and when introducing new features that might break older versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dug through some older commits and I think that versions older than 10.2 should work for appropriate devices.
We can't easily test it since we don't have a system where 10.x or older versions are installed.
I've updated the doc accordingly.

the system, refer to
[NVIDIA CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)
or
Expand All @@ -191,14 +193,17 @@ Optimization options for CUDA toolkits prior to 11.6 (This is due to a bug in
earlier versions of the CUDA toolkit: see
[this issue](https://forums.developer.nvidia.com/t/libdevice-functions-causing-ptxas-segfault/193352)).

An installation of at least
[CUDA 11.0](https://developer.nvidia.com/cuda-11.0-download-archive)
is required to fully utilize Turing (SM 75) devices and to enable Ampere (SM 80)
core features.

The CUDA backend should work on Windows or Linux operating systems with any
GPU compatible with SM 50 or above. The default SM for the NVIDIA CUDA backend
is 5.0. Users can specify lower values, but some features may not be supported.
GPU with compute capability (SM version) sm_50 or above. The default
SM version for the NVIDIA CUDA backend is sm_50. Users of sm_3X devices can
attempt to specify the target architecture [ahead of time](#aot-target-architectures),
provided that they use a 11.X CUDA Runtime version, but some features may
not be supported. The CUDA backend has been tested with
different Ubuntu linux distributions and a selection of supported CUDA toolkit versions
and GPUs. The backend is tested by a relevant device/toolkit prior to a oneapi plugin release.
Go to the plugin release [pages](https://developer.codeplay.com/products/oneapi/nvidia/)
for further details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The CUDA backend should work on Windows or Linux operating systems with any
GPU compatible with SM 50 or above. The default SM for the NVIDIA CUDA backend
is 5.0. Users can specify lower values, but some features may not be supported.
GPU with compute capability (SM version) sm_50 or above. The default
SM version for the NVIDIA CUDA backend is sm_50. Users of sm_3X devices can
attempt to specify the target architecture [ahead of time](#aot-target-architectures),
provided that they use a 11.X CUDA Runtime version, but some features may
not be supported. The CUDA backend has been tested with
different Ubuntu linux distributions and a selection of supported CUDA toolkit versions
and GPUs. The backend is tested by a relevant device/toolkit prior to a oneapi plugin release.
Go to the plugin release [pages](https://developer.codeplay.com/products/oneapi/nvidia/)
for further details.
The CUDA backend should work on Windows or Linux operating systems with any GPU
with compute capability (SM version) sm_50 or above. The default SM version for
the NVIDIA CUDA backend is sm_50. Users of sm_3X devices can attempt to specify
the target architecture [ahead of time](#aot-target-architectures), provided
that they use a 11.X CUDA Runtime version, but some features may not be
supported. The CUDA backend has been tested with different Ubuntu Linux
distributions and a selection of supported CUDA toolkit versions and GPUs. The
backend is tested by a relevant device/toolkit prior to a oneAPI plugin release.
Go to the plugin release
[pages](https://developer.codeplay.com/products/oneapi/nvidia/) for further
details.
  1. Use 80-char line limit.
  2. linux -> Linux
  3. oneapi -> oneAPI


**Non-standard CUDA location**:

Expand All @@ -220,9 +225,13 @@ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DPCPP_HOME/llvm/build/lib ./a.out

### Build DPC++ toolchain with support for HIP AMD

There is experimental support for DPC++ for HIP on AMD devices. Note as this is
still experimental and there is no continuous integration for this yet there
are therefore no guarantees for supported platforms or configurations.
There is beta support for DPC++ for HIP on AMD devices.
It is not feature complete and it still contains known and unknown bugs.
Currently it has only been tried on Linux, with ROCm 4.2.0, 4.3.0, 4.5.2, 5.3.0, and 5.4.3,
using the AMD Radeon Pro W6800 (gtx1030), MI50 (gfx906), MI100 (gfx908) and MI250x (gfx90a) devices.
The backend is tested by a relevant device/toolkit prior to a oneapi plugin release.
Go to the plugin release [pages](https://developer.codeplay.com/products/oneapi/amd)
for further details.
JackAKirk marked this conversation as resolved.
Show resolved Hide resolved

To enable support for HIP devices, follow the instructions for the Linux
DPC++ toolchain, but add the `--hip` flag to `configure.py`
Expand All @@ -241,8 +250,6 @@ python $DPCPP_HOME/llvm/buildbot/configure.py --hip \
--cmake-opt=-DSYCL_BUILD_PI_HIP_ROCM_DIR=/usr/local/rocm
```

Currently, this has only been tried on Linux, with ROCm 4.2.0 or 4.3.0 and
using the MI50 (gfx906) and MI100 (gfx908) devices.

[LLD](https://llvm.org/docs/AMDGPUUsage.html) is necessary for the AMDGPU compilation chain.
The AMDGPU backend generates a standard ELF [ELF] relocatable code object that can be linked by lld to
Expand All @@ -252,9 +259,9 @@ on building LLD refer to [LLD Build Guide](https://lld.llvm.org/).

### Build DPC++ toolchain with support for HIP NVIDIA

There is experimental support for DPC++ for HIP on Nvidia devices. Note as this
is still experimental and there is no continuous integration for this yet there
are therefore no guarantees for supported platforms or configurations.
There is experimental support for DPC++ for HIP on Nvidia devices.
There is no continuous integration for this and there are
no guarantees for supported platforms or configurations.
JackAKirk marked this conversation as resolved.
Show resolved Hide resolved

This is a compatibility feature and the [CUDA backend](#build-dpc-toolchain-with-support-for-nvidia-cuda)
should be preferred to run on NVIDIA GPUs.
Expand All @@ -265,7 +272,7 @@ DPC++ toolchain, but add the `--hip` and `--hip-platform NVIDIA` flags to

Enabling this flag requires HIP to be installed, more specifically
[HIP NVCC](https://rocmdocs.amd.com/en/latest/Installation_Guide/HIP-Installation.html#nvidia-platform),
as well as CUDA to be installed, see
as well as the CUDA Runtime API to be installed, see
[NVIDIA CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html).

Currently, this has only been tried on Linux, with ROCm 4.2.0 or 4.3.0, with
Expand Down Expand Up @@ -642,6 +649,8 @@ clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda \
simple-sycl-app.cpp -o simple-sycl-app-cuda.exe
```

#### AOT Target architectures

When building for HIP AMD, use the AMD target triple and specify the
target architecture with `-Xsycl-target-backend --offload-arch=<arch>`
as follows:
Expand Down Expand Up @@ -817,24 +826,6 @@ int CUDASelector(const sycl::device &Device) {

```

### Using the DPC++ toolchain on CUDA platforms

Currently, the DPC++ toolchain relies on having a recent OpenCL implementation
on the system in order to link applications to the DPC++ runtime.
The OpenCL implementation is not used at runtime if only the CUDA backend is
used in the application, but must be installed.

The OpenCL implementation provided by the CUDA SDK is OpenCL 1.2, which is
too old to link with the DPC++ runtime and lacks some symbols.

We recommend installing the low level CPU runtime, following the instructions
in the next section.

Instead of installing the low level CPU runtime, it is possible to build and
install the
[Khronos ICD loader](https://github.com/KhronosGroup/OpenCL-ICD-Loader),
which contains all the symbols required.

## C++ standard

* DPC++ runtime and headers require C++17 at least.
Expand All @@ -854,11 +845,7 @@ which contains all the symbols required.

### CUDA back-end limitations

* Backend is only supported on Linux
* The only combination tested is Ubuntu 22.04 with CUDA 11.7 using a Titan RTX
GPU (SM 71), but it should work on any GPU compatible with SM 50 or above
* The NVIDIA OpenCL headers conflict with the OpenCL headers required for this
project and may cause compilation issues on some platforms
* Windows support is currently experimental and not regularly tested.
* `sycl::sqrt` is not correctly rounded by default as the SYCL specification
allows lower precision, when porting from CUDA it may be helpful to use
`-fsycl-fp32-prec-sqrt` to use the correctly rounded square root, this is
Expand All @@ -883,12 +870,7 @@ which contains all the symbols required.
* Requires a ROCm compatible operating system, for full details of supported
Operating System for ROCm, please refer to the
[ROCm Supported Operating Systems](https://github.com/RadeonOpenCompute/ROCm#supported-operating-systems).
* Has only been tried with ROCm 4.2.0 and 4.3.0.
* Has only been tested using the MI50 (gfx906) and MI100 (gfx908) devices.
* Support is still experimental so not all of the tests are currently passing
and many of the built-in function are not yet implemented.
* Additionally there is no continuous integration yet so no guarantee can be
made for support platforms or configurations
* Support is still in a beta state, but the backend is being actively developed.
* Global offsets are currently not supported.

## Find More
Expand Down