Skip to content

[SYCL] Change linkage to linkonce_odr unless SYCL_EXTERNAL. #10317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 12, 2023

Conversation

hvdijk
Copy link
Contributor

@hvdijk hvdijk commented Jul 12, 2023

In getLLVMLinkageForDeclarator, CUDA device functions are given internal linkage to allow improved IPO. The same optimizations would also be useful for SYCL. However, as multiple translation units are a possibility here, it cannot be done exactly the same way. There are a number of extra considerations that need to be made.

  • If a device function or object is annotated with the SYCL_EXTERNAL macro, it may be called from a different module and its handling should not be touched.

  • If a device function or object is not annotated with the SYCL_EXTERNAL macro, it may not be called from a different translation unit unless that other translation unit provides its own definition of it. As such, if other optimizations remove all references in the current module, the function or object itself can be removed, and other passes should be aware that it is legal to remove the function or object.

  • If two kernels defined in different translation units call into the same non-kernel inline or template device function, these different copies of the device function would previously be merged together, and should continue to be.

The linkage that satisfies these criteria is linkonce_odr, so this commit marks functions and objects so, unless the sycl_kernel or sycl_device attributes are present which indicate that it must remain externally accessible.

@hvdijk hvdijk requested review from a team as code owners July 12, 2023 00:22
@hvdijk hvdijk requested a review from steffenlarsen July 12, 2023 00:22
@hvdijk hvdijk temporarily deployed to aws July 12, 2023 00:39 — with GitHub Actions Inactive
@hvdijk hvdijk temporarily deployed to aws July 12, 2023 01:40 — with GitHub Actions Inactive
@bader bader requested review from AlexeySachkov and LU-JOHN July 12, 2023 02:47
@bader
Copy link
Contributor

bader commented Jul 12, 2023

  • If a device function or object is not annotated with the SYCL_EXTERNAL macro, it may not be called from a different translation unit unless that other translation unit provides its own definition of it. As such, if other optimizations remove all references in the current module, the function or object itself can be removed, and other passes should be aware that it is legal to remove the function or object.

I think the right linkage type for this case is "internal".

@hvdijk
Copy link
Contributor Author

hvdijk commented Jul 12, 2023

I think the right linkage type for this case is "internal".

internal would indeed allow that, but linkonce_odr does allow unreferenced functions to be removed too, see GlobalValue::isDiscardableIfUnused. As internal and linkonce_odr both handle this equally, and internal does not satisfy the other criteria but linkonce_odr does, I think internal is not the right linkage to use.

@hvdijk hvdijk temporarily deployed to aws July 12, 2023 09:59 — with GitHub Actions Inactive
@hvdijk hvdijk temporarily deployed to aws July 12, 2023 11:09 — with GitHub Actions Inactive
@bader
Copy link
Contributor

bader commented Jul 14, 2023

As internal and linkonce_odr both handle this equally

Not exactly. AFAIK, there is a difference in how optimizations handle these linkage types. I think we might want to prefer internal over linkonce_odr get enable more optimization opportunities.

internal does not satisfy the other criteria but linkonce_odr does

I implied that we use internal only when it's the right thing to do and use linkonce_odr only when we can't use internal.

Anyway, I don't mind merging this patch as-is and make a separate patch if we prove that using internal is benefitial.

@hvdijk
Copy link
Contributor Author

hvdijk commented Jul 14, 2023

Not exactly. AFAIK, there is a difference in how optimizations handle these linkage types. I think we might want to prefer internal over linkonce_odr get enable more optimization opportunities.

That is true, there is indeed a difference in how some other optimizations handle these linkage types. The unreferenced function removal was the only thing I was thinking of, and that handles the two linkage types the same, but other optimizations are different.

I implied that we use internal only when it's the right thing to do and use linkonce_odr only when we can't use internal.

Anyway, I don't mind merging this patch as-is and make a separate patch if we prove that using internal is benefitial.

Ah, gotcha. In this function we have to account for the worst possibility, we do not have a lot of information to do better, but I actually see there is one case where internal is possible. I had not previously noticed that DPC++ has the -fno-sycl-rdc option. If that option is used, we can use internal. I will update this PR to do so. This may also permit sharing the logic with the existing CUDA check, I will check.

@bader bader requested a review from sarnex July 14, 2023 02:07
@hvdijk
Copy link
Contributor Author

hvdijk commented Jul 14, 2023

I have updated as said, and used the idea of the CodeGenCUDA directory to modify tests to run with and without -fno-sycl-rdc (actually, with -f(no-)gpu-rdc since they use %clang_cc1). I did try checking whether this could be used for CUDA as well, but found that I do not know enough about CUDA to make sure, so I left CUDA alone. It does allow someone else to move the CUDA check to combine it with the SYCL check later, in a separate PR, if they can make sure it works. Does this look okay to you?

@hvdijk hvdijk temporarily deployed to aws July 14, 2023 17:26 — with GitHub Actions Inactive
@hvdijk hvdijk temporarily deployed to aws July 14, 2023 18:28 — with GitHub Actions Inactive
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime changes LGTM. I worry slightly that there was a reason we split the aspects and srcloc checks, but if it becomes a problem we can split the test in two.

Comment on lines 5855 to 5859
// SYCL: Device code is not generally limited to one TU, but anything accessed
// from another translation unit is required to be annotated with the
// SYCL_EXTERNAL keyword. For any function or variable that does not have
// this, linkonce suffices. If -fno-sycl-rdc, though, we know there is only
// one translation unit and can so mark them internal instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// SYCL: Device code is not generally limited to one TU, but anything accessed
// from another translation unit is required to be annotated with the
// SYCL_EXTERNAL keyword. For any function or variable that does not have
// this, linkonce suffices. If -fno-sycl-rdc, though, we know there is only
// one translation unit and can so mark them internal instead.
// SYCL: Device code is not generally limited to one translation unit, but anything accessed
// from another translation unit is required to be annotated with the
// SYCL_EXTERNAL macro. For any function or variable that does not have
// this, linkonce_odr suffices. If -fno-sycl-rdc is passed, we know there is only
// one translation unit and can so mark functions internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly applied, thank you. The one thing is that the last "them" refers to both functions and variables, so replacing it with just "functions" does not cover it entirely. Are you okay with leaving that as "them", or would you prefer I repeat that as "functions and variables"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with how it is applied now, thank you.

@Fznamznon Fznamznon requested a review from bader July 19, 2023 07:46
@hvdijk hvdijk force-pushed the sycl-linkage branch 2 times, most recently from d8d6d6f to 5b5190d Compare July 19, 2023 10:04
Copy link
Contributor

@Fznamznon Fznamznon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CFE change LGTM.

@hvdijk hvdijk temporarily deployed to aws July 19, 2023 10:21 — with GitHub Actions Inactive
@hvdijk hvdijk temporarily deployed to aws July 19, 2023 10:59 — with GitHub Actions Inactive
@@ -1,7 +1,9 @@
// RUN: %clang_cc1 -triple spir64-unknown-unknown -disable-llvm-passes -fsycl-is-device -emit-llvm %s -o - | FileCheck %s
// RUN: %clang_cc1 -triple spir64-unknown-unknown -disable-llvm-passes -fsycl-is-device -fgpu-rdc -emit-llvm %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-RDC
// RUN: %clang_cc1 -triple spir64-unknown-unknown -disable-llvm-passes -fsycl-is-device -fno-gpu-rdc -emit-llvm %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-NORDC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea to extend all failing tests with -fno-gpu-rdc case. For example, the purpose of this test is to check llvm loop metadata and we don't want this test to fail on changing linkage type. The right fix for this test is to drop checking linkage type in CHECK-LABEL: patterns.

I would expect 1-2 tests to check the linkage type, but all other tests should ignore it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note though that the test was already checking the linkage type, that is not new behaviour of the test. When something like that changes, is it more common to just update the test to whatever is generated by the new version, or is it more common to update the test to allow any result? In upstream LLVM, it would be the former, so I went with that approach. If DPC++ usually does it differently, happy to make that change.

I can definitely agree that this is not what this particular test is about, and see your point that checking both -fgpu-rdc and -fno-gpu-rdc is overkill. Going over the tests, I think for device_global.cpp and sycl-cuda-host-device-functions.cu, checking both -fgpu-rdc and -fno-gpu-rdc makes sense for the test, and if we have those two, we cover both functions and variables. For the others, I will take out the -fno-gpu-rdc versions of the tests, and then check with you once that is ready if you would still like me to additionally loosen the checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Looking over the new diff, I do actually see a lot of those affected tests using {{.*}} in various places already, so will just update them the way you suggested under the assumption that this is indeed a difference in test styles. If it turns out you actually like the current version after all, thanks to Git, it's easy to go back. :)

@hvdijk hvdijk force-pushed the sycl-linkage branch 2 times, most recently from f7743c9 to ab36075 Compare July 19, 2023 23:20
@hvdijk hvdijk temporarily deployed to aws July 19, 2023 23:38 — with GitHub Actions Inactive
@hvdijk hvdijk temporarily deployed to aws July 20, 2023 00:17 — with GitHub Actions Inactive
@hvdijk
Copy link
Contributor Author

hvdijk commented Jul 24, 2023

Thanks for the approvals. The CI failure for HIP AMDGPU looks unrelated to this PR and has shown up in other PRs as well (e.g. https://github.com/intel/llvm/actions/runs/5525573620/job/14962770834?pr=10313). The CI failure for Intel CPU/GEN9 GPU, I am not sure what is happening there but I see there have been relevant changes to CI, so before looking into that, I rebased to re-trigger CI and see what it shows now. Aside from checking the CI results, is this waiting on anything else from me?

@hvdijk hvdijk temporarily deployed to aws July 24, 2023 23:08 — with GitHub Actions Inactive
@hvdijk hvdijk temporarily deployed to aws July 24, 2023 23:46 — with GitHub Actions Inactive
@steffenlarsen
Copy link
Contributor

Thanks for the approvals. The CI failure for HIP AMDGPU looks unrelated to this PR and has shown up in other PRs as well (e.g. https://github.com/intel/llvm/actions/runs/5525573620/job/14962770834?pr=10313). The CI failure for Intel CPU/GEN9 GPU, I am not sure what is happening there but I see there have been relevant changes to CI, so before looking into that, I rebased to re-trigger CI and see what it shows now. Aside from checking the CI results, is this waiting on anything else from me?

I think the device_global related failures are indeed related to this patch. I suspect something about the linkonce_odr linkage on these is either causing some optimizations we don't want for device_global or that IGC isn't happy about the produced SPIR-V.

@hvdijk
Copy link
Contributor Author

hvdijk commented Jul 25, 2023

I think the device_global related failures are indeed related to this patch.

Thanks Steffen. These are new tests that were not yet part on the commit that I was originally working on (or, at least, not yet enabled). I agree that they look like they are caused by this patch and am looking into them. The errors that were showing up earlier have disappeared, so I will disregard those as likely unrelated and since fixed.

@hvdijk
Copy link
Contributor Author

hvdijk commented Jul 25, 2023

I think there is a design decision to be made here. The whole idea of SYCL_EXTERNAL is that a device function or variable that doesn't have it may be presumed local to the translation unit, the whole idea of device_global is that a variable that does have it shares it with the host. It follows that it makes no sense to have device_global without some form of SYCL_EXTERNAL. I see three options.

  1. Decide that user code must specify SYCL_EXTERNAL explicitly. This would require an update of the tests.
  2. Decide that the front end should make the device_global attribute imply SYCL_EXTERNAL / the sycl_device attribute.
  3. Decide that all code that checks for sycl_device should also check for device_global.

I think 1 makes for a poor user experience. 2 and 3 are the same from the user's point of view, if done correctly. I think 2 makes most sense, but requires bigger changes. I think 3 should be a simple short change that's sufficient to get this merged, and 2 can be done later if so desired. I will try to update the PR accordingly, but if you would prefer another direction, please let me know.

@steffenlarsen
Copy link
Contributor

I think there is a design decision to be made here. The whole idea of SYCL_EXTERNAL is that a device function or variable that doesn't have it may be presumed local to the translation unit, the whole idea of device_global is that a variable that does have it shares it with the host. It follows that it makes no sense to have device_global without some form of SYCL_EXTERNAL. I see three options.

  1. Decide that user code must specify SYCL_EXTERNAL explicitly. This would require an update of the tests.
  2. Decide that the front end should make the device_global attribute imply SYCL_EXTERNAL / the sycl_device attribute.
  3. Decide that all code that checks for sycl_device should also check for device_global.

I think 1 makes for a poor user experience. 2 and 3 are the same from the user's point of view, if done correctly. I think 2 makes most sense, but requires bigger changes. I think 3 should be a simple short change that's sufficient to get this merged, and 2 can be done later if so desired. I will try to update the PR accordingly, but if you would prefer another direction, please let me know.

I don't think we want option 1. either, as it means something different for device_global. Likewise, I think that makes 2. somewhat problematic. I am okay with seeing where option 3 takes us.

Note that the way avoided the compiler messing with internal device_global currently is to have llvm.used refer to them. I am not sure why that doesn't work for this case.

@hvdijk
Copy link
Contributor Author

hvdijk commented Jul 25, 2023

Note that the way avoided the compiler messing with internal device_global currently is to have llvm.used refer to them. I am not sure why that doesn't work for this case.

That llvm.used should be kept around throughout the LLVM IR parts of the compilation process, and should ensure that the global gets emitted into the SPIR-V object even if it weren't referenced, but the linkage inside the SPIR-V object still changes the same way as in LLVM, from Export to LinkOnceODR, and I am not seeing anything like llvm.used still around in the SPIR-V object. I cannot see just what the driver is doing with this, but I can imagine that a driver is not prepared to handle this and I think we should take care of that before we get to the SPIR-V generation.

@hvdijk hvdijk temporarily deployed to aws July 25, 2023 16:31 — with GitHub Actions Inactive
@hvdijk hvdijk temporarily deployed to aws July 25, 2023 17:10 — with GitHub Actions Inactive
Copy link
Contributor

@sarnex sarnex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change seems sane to me and the no-rdc-related changes/statements seem correct, so that part lgtm. i am not an expert on llvm linkage types so ill leave that to others

In getLLVMLinkageForDeclarator, CUDA device functions are given internal
linkage to allow improved IPO. The same optimizations would also be
useful for SYCL. However, as multiple translation units are a
possibility here, it cannot be done exactly the same way. There are a
number of extra considerations that need to be made.

- If a device function or object is annotated with the SYCL_EXTERNAL
  macro, it may be called from a different module and its handling
  should not be touched.

- If a device function or object is not annotated with the SYCL_EXTERNAL
  macro, it may not be called from a different translation unit unless
  that other translation unit provides its own definition of it. As
  such, if other optimizations remove all references in the current
  module, the function or object itself can be removed, and other passes
  should be aware that it is legal to remove the function or object.

- If two kernels defined in different translation units call into the
  same non-kernel inline or template device function, these different
  copies of the device function would previously be merged together, and
  should continue to be.

The linkage that satisfies these criteria is linkonce_odr, so this
commit marks functions and objects so, unless the sycl_kernel or
sycl_device attributes are present which indicate that it must remain
externally accessible.

If -fno-sycl-rdc is used, these concerns go away and we can use internal
linkage instead.
@hvdijk
Copy link
Contributor Author

hvdijk commented Aug 29, 2023

Rebased to resolve a conflict after the removal of no-opaque-pointer tests.

@hvdijk
Copy link
Contributor Author

hvdijk commented Sep 8, 2023

ping

@bader bader merged commit 4cb4fec into intel:sycl Sep 12, 2023
whitneywhtsang added a commit to sys-ce-bb/llvm that referenced this pull request Sep 12, 2023
victor-eds added a commit to victor-eds/llvm that referenced this pull request Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants