-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Support per-object file compilation #7595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Today, the basic flow of the SYCL device-side driver is as follows 1) Link all device code from all compiler inputs together 2) Link 1) against SYCL device libraries 3) Run sycl-post-link on 2) 4) Run llvm-spirv on 3) 5) Run clang-offload-wrapper on 4) Step 1 can create a performance bottleneck when you have a huge number of kernels. If none of the kernels use globals or SYCL_EXTERNAL functions, we can actually split it up to be the following: For object file: 1) Link this object file's device code with SYCL device libraries 2) Run sycl-post-link on 1) 3) Run llvm-spirv on 2) 4) Run clang-offload-wrapper on 4) Since we don't link all device code together, each step runs on smaller IR which results in compiler runtime and compiler memory usage benefits. Note that in order to do the above per-object-file, we need to break up fat static arhives. We do this by using the ForEachWrappingAction action. This allows us to run commands on each item inside the fat static archive The driver flow when a static archive is involved is the most complex case and looks like the below: 1) spriv-to-ir-wrapper on fat static archive 2) Link all SYCL device libraries together without any user device code into a single device library BC 3) llvm-foreach: 3a): llvm-link current object file with 2) 4) file-table-tform replacing tempfilelist from 1) with output of 3) 5) llvm-foreach: 5a) sycl-post-link on 4) 6) llvm-foreach: 6a) Extract BC file column from 5) output table 7) file-table-tform on 6) merging all BC file columns into a single big column 8) llvm-spirv on 7) (does llvm-foreach internally) 9) file-table-tform on 5) output table, replacing BC column with spirv column from 8) 10) clang-offload-wrapper with 9) Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
I don't really like this commit, but I see the following requirements 1) Keep the default for GPURelocatableDeviceCode to false 2) For sycl, if no cc1 option is specified, GPURelocatableDeviceCode should be true Let me know if anyone has any better ideas Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FE changes LGTM.
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
@AlexeySachkov @mdtoguchi Any more feedback on this bad boy? I think I addressed all feedback. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more questions here and minor comments.
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
@premanandrao Do you mind re-checking the FE changes? I had to rework them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no further concerns/comments, thanks
Thanks, with the CFE re-review complete we are now ready to merge! |
Thanks, I have no further concerns. |
@intel/llvm-gatekeepers Mind merging this one? Thanks! |
This change adds per-object compilation support for SYCL, also called non-relocatable device code mode. This is already supported in clang for HIP and CUDA.
It adds a new option -f[no-]sycl-rdc. The default is -fsycl-rdc, which compiles code as today. Passing -fno-sycl-rdc activates the new mode. This is just an alias to the existing flag used by AMD/CUDA, f[no-]-gpu-rdc.
The main implication is that we no longer link all device code together into one big module before post link.
Instead, we execute all jobs after device linking on a per-object file basis.
This means sycl-post-link and the later jobs execute multiple times, since we no longer have one big module.
This can result in large improvement performance in the compiler runtime and memory usage, we see a max memory usage reduction for QUDA with -g from over 250GB to 4GB and a large compiler runtime improvement as well.
Error cases: