You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Step 1 can take hours for some targets. But if the user wish to recompile after
372
-
modifying only host_a.cpp and host_b.cpp, they can simply run steps 2 and 3 without
373
-
rerunning the expensive step 1.
374
+
modifying only host_a.cpp and host_b.cpp, they can simply run steps 2 and 3
375
+
without rerunning the expensive step 1.
374
376
375
-
The compiler is responsible for verifying that the user provided all the relevant
376
-
files to the device link step. There are 2 cases that have to be checked:
377
+
The compiler is responsible for verifying that the user provided all the
378
+
relevant files to the device link step. There are 2 cases that have to be
379
+
checked:
377
380
378
381
1. Missing symbols referenced by the kernels present in the device link step
379
382
(e.g. functions called by or global variables used by the known kernels).
380
383
2. Missing kernels.
381
384
382
-
Case 1 can be identified in the device binary generation stage (step 1) by scanning
383
-
the known kernels. Case 2 must be verified by the driver by checking for newly
384
-
introduced kernels in the final link stage (step 3).
385
+
Case 1 can be identified in the device binary generation stage (step 1) by
386
+
scanning the known kernels. Case 2 must be verified by the driver by checking
387
+
for newly introduced kernels in the final link stage (step 3).
385
388
386
389
The llvm-no-spir-kernel tool was introduced to facilitate checking for case 2 in
387
390
the driver. It detects if a module includes kernels and is invoked as follows:
@@ -438,24 +441,40 @@ unit)
438
441
439
442
#### CUDA support
440
443
441
-
The driver supports compilation to NVPTX when the `nvptx64-nvidia-cuda-sycldevice` is passed to `-fsycl-targets`.
444
+
The driver supports compilation to NVPTX when the
445
+
`nvptx64-nvidia-cuda-sycldevice` is passed to `-fsycl-targets`.
442
446
443
-
Unlike other AOT targets, the bitcode module linked from intermediate compiled objects never goes through SPIR-V. Instead it is passed directly in bitcode form down to the NVPTX Back End. All produced bitcode depends on two libraries, `libdevice.bc` (provided by the CUDA SDK) and `libspirv-nvptx64--nvidiacl.bc` (built by the libclc project).
447
+
Unlike other AOT targets, the bitcode module linked from intermediate compiled
448
+
objects never goes through SPIR-V. Instead it is passed directly in bitcode form
449
+
down to the NVPTX Back End. All produced bitcode depends on two libraries,
450
+
`libdevice.bc` (provided by the CUDA SDK) and `libspirv-nvptx64--nvidiacl.bc`
451
+
(built by the libclc project).
444
452
445
-
During the device linking step (device linker box in the [Separate Compilation and Linking](#separate-compilation-and-linking) illustration), llvm bitcode objects for the CUDA target are linked together alongside `libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX using the NVPTX backend, and assembled into a cubin using the `ptxas` tool (part of the CUDA SDK). The PTX file and cubin are assembled together using `fatbinary` to produce a CUDA fatbin. The CUDA fatbin is then passed to the offload wrapper tool.
453
+
During the device linking step (device linker box in the
454
+
[Separate Compilation and Linking](#separate-compilation-and-linking)
455
+
illustration), llvm bitcode objects for the CUDA target are linked together
456
+
alongside `libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX
457
+
using the NVPTX backend, and assembled into a cubin using the `ptxas` tool (part
458
+
of the CUDA SDK). The PTX file and cubin are assembled together using
459
+
`fatbinary` to produce a CUDA fatbin. The CUDA fatbin is then passed to the
460
+
offload wrapper tool.
446
461
447
462
##### Checking if the compiler is targeting NVPTX
448
463
449
-
When the SYCL compiler is in device mode and targeting the NVPTX backend, compiler defines the macro `__SYCL_NVPTX__`.
450
-
This macro can safely be used to enable NVPTX specific code path in SYCL kernels.
464
+
When the SYCL compiler is in device mode and targeting the NVPTX backend,
465
+
compiler defines the macro `__SYCL_NVPTX__`.
466
+
This macro can safely be used to enable NVPTX specific code path in SYCL
467
+
kernels.
451
468
452
469
*Note: this macro is only define during the device compilation phase.*
453
470
454
471
##### NVPTX Builtins
455
472
456
-
When the SYCL compiler is in device mode and targeting the NVPTX backend, the compiler exposes NVPTX builtins supported by clang.
473
+
When the SYCL compiler is in device mode and targeting the NVPTX backend, the
474
+
compiler exposes NVPTX builtins supported by clang.
457
475
458
-
*Note: this enable NVPTX specific features which cannot be supported by other targets or the host.*
476
+
*Note: this enable NVPTX specific features which cannot be supported by other
In CUDA, users can only allocate one chunk of host allocated shared memory (which maps to SYCL's local accessors).
476
-
This chunk of memory is allocated as an array `extern __shared__ <type> <name>[];` which LLVM represents as an external global symbol to the CUDA shared memory address space.
477
-
The NVPTX backend then lowers this into a `.extern .shared .align 4 .b8` PTX instruction.
494
+
In CUDA, users can only allocate one chunk of host allocated shared memory
495
+
(which maps to SYCL's local accessors). This chunk of memory is allocated as an
496
+
array `extern __shared__ <type> <name>[];` which LLVM represents as an external
497
+
global symbol to the CUDA shared memory address space. The NVPTX backend then
498
+
lowers this into a `.extern .shared .align 4 .b8` PTX instruction.
478
499
479
-
In SYCL, users can allocate multiple local accessors and pass them as kernel parameters. When the SYCL frontend lowers the SYCL kernel invocation into an OpenCL compliant kernel entry, it lowers local accessors into a pointer to OpenCL local memory (CUDA shared memory) but this is not legal for CUDA kernels.
500
+
In SYCL, users can allocate multiple local accessors and pass them as kernel
501
+
parameters. When the SYCL frontend lowers the SYCL kernel invocation into an
502
+
OpenCL compliant kernel entry, it lowers local accessors into a pointer to
503
+
OpenCL local memory (CUDA shared memory) but this is not legal for CUDA kernels.
480
504
481
-
To legalize the SYCL lowering for CUDA, a SYCL for CUDA specific pass will do the following:
505
+
To legalize the SYCL lowering for CUDA, a SYCL for CUDA specific pass will do
506
+
the following:
482
507
- Create a global symbol to the CUDA shared memory address space
483
-
- Transform all pointers to CUDA shared memory into a 32 bit integer representing the offset in bytes to use with the global symbol
484
-
- Replace all uses of the transformed pointers by the address to global symbol offset by the value of the integer passed as parameter
508
+
- Transform all pointers to CUDA shared memory into a 32 bit integer
509
+
representing the offset in bytes to use with the global symbol
510
+
- Replace all uses of the transformed pointers by the address to global symbol
511
+
offset by the value of the integer passed as parameter
On the runtime side, when setting local memory arguments, the CUDA PI implementation will internally set the argument as the offset with respect to the accumulated size of used local memory. This approach preserves the exisiting PI interface.
533
+
On the runtime side, when setting local memory arguments, the CUDA PI
534
+
implementation will internally set the argument as the offset with respect to
535
+
the accumulated size of used local memory. This approach preserves the exisiting
536
+
PI interface.
506
537
507
538
### Integration with SPIR-V format
508
539
@@ -537,8 +568,8 @@ Translation from LLVM IR to SPIR-V for special types is also supported, but
537
568
such LLVM IR must comply to some special requirements. Unfortunately there is
538
569
no canonical form of special built-in types and operations in LLVM IR, moreover
539
570
we can't re-use existing representation generated by OpenCL C front-end
540
-
compiler. For instance here is how `OpGroupAsyncCopy` operation looks in LLVM IR
541
-
produced by OpenCL C front-end compiler.
571
+
compiler. For instance here is how `OpGroupAsyncCopy` operation looks in LLVM
0 commit comments