Skip to content

[CUDA] Include PTX in non-RDC mode using the new driver #84367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 7, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Mar 7, 2024

Summary:
The old driver embed PTX in rdc-mode and so does the nvcc compiler.
The new drivers currently does not do this, so we should keep it
consistent in this case. This simply requires adding the assembler
output as an input to the offloading action that gets fed to fatbin.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Mar 7, 2024
@llvmbot
Copy link
Member

llvmbot commented Mar 7, 2024

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-driver

Author: Joseph Huber (jhuber6)

Changes

Summary:
The old driver embed PTX in rdc-mode and so does the nvcc compiler.
The new drivers currently does not do this, so we should keep it
consistent in this case. This simply requires adding the assembler
output as an input to the offloading action that gets fed to fatbin.


Full diff: https://github.com/llvm/llvm-project/pull/84367.diff

2 Files Affected:

  • (modified) clang/lib/Driver/Driver.cpp (+8)
  • (modified) clang/test/Driver/cuda-phases.cu (+13-12)
diff --git a/clang/lib/Driver/Driver.cpp b/clang/lib/Driver/Driver.cpp
index cecd34acbc92c0..96e6ad77f5e50d 100644
--- a/clang/lib/Driver/Driver.cpp
+++ b/clang/lib/Driver/Driver.cpp
@@ -4625,7 +4625,15 @@ Action *Driver::BuildOffloadingActions(Compilation &C,
       DDeps.add(*A, *TCAndArch->first, TCAndArch->second.data(), Kind);
       OffloadAction::DeviceDependences DDep;
       DDep.add(*A, *TCAndArch->first, TCAndArch->second.data(), Kind);
+
+      // Compiling CUDA in non-RDC mode uses the PTX output if available.
+      for (Action *Input : A->getInputs())
+        if (Kind == Action::OFK_Cuda && A->getType() == types::TY_Object &&
+            !Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc,
+                          false))
+          DDep.add(*Input, *TCAndArch->first, TCAndArch->second.data(), Kind);
       OffloadActions.push_back(C.MakeAction<OffloadAction>(DDep, A->getType()));
+
       ++TCAndArch;
     }
   }
diff --git a/clang/test/Driver/cuda-phases.cu b/clang/test/Driver/cuda-phases.cu
index 9a231091de2bdc..a1c3c9b51b1e41 100644
--- a/clang/test/Driver/cuda-phases.cu
+++ b/clang/test/Driver/cuda-phases.cu
@@ -244,31 +244,32 @@
 // NEW-DRIVER-RDC-NEXT: 18: assembler, {17}, object, (host-cuda)
 // NEW-DRIVER-RDC-NEXT: 19: clang-linker-wrapper, {18}, image, (host-cuda)
 
-// RUN: %clang -### -target powerpc64le-ibm-linux-gnu -ccc-print-phases --offload-new-driver -fgpu-rdc \
+// RUN: %clang -### -target powerpc64le-ibm-linux-gnu -ccc-print-phases --offload-new-driver \
 // RUN:   --offload-arch=sm_52 --offload-arch=sm_70 %s 2>&1 | FileCheck --check-prefix=NEW-DRIVER %s
-//      NEW-DRIVER: 0: input, "[[INPUT:.+]]", cuda
-// NEW-DRIVER-NEXT: 1: preprocessor, {0}, cuda-cpp-output
-// NEW-DRIVER-NEXT: 2: compiler, {1}, ir
-// NEW-DRIVER-NEXT: 3: input, "[[INPUT]]", cuda, (device-cuda, sm_52)
+//      NEW-DRIVER: 0: input, "[[CUDA:.+]]", cuda, (host-cuda)
+// NEW-DRIVER-NEXT: 1: preprocessor, {0}, cuda-cpp-output, (host-cuda)
+// NEW-DRIVER-NEXT: 2: compiler, {1}, ir, (host-cuda)
+// NEW-DRIVER-NEXT: 3: input, "[[CUDA]]", cuda, (device-cuda, sm_52)
 // NEW-DRIVER-NEXT: 4: preprocessor, {3}, cuda-cpp-output, (device-cuda, sm_52)
 // NEW-DRIVER-NEXT: 5: compiler, {4}, ir, (device-cuda, sm_52)
 // NEW-DRIVER-NEXT: 6: backend, {5}, assembler, (device-cuda, sm_52)
 // NEW-DRIVER-NEXT: 7: assembler, {6}, object, (device-cuda, sm_52)
-// NEW-DRIVER-NEXT: 8: offload, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {7}, object
-// NEW-DRIVER-NEXT: 9: input, "[[INPUT]]", cuda, (device-cuda, sm_70)
+// NEW-DRIVER-NEXT: 8: offload, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {7}, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {6}, object
+// NEW-DRIVER-NEXT: 9: input, "[[CUDA]]", cuda, (device-cuda, sm_70)
 // NEW-DRIVER-NEXT: 10: preprocessor, {9}, cuda-cpp-output, (device-cuda, sm_70)
 // NEW-DRIVER-NEXT: 11: compiler, {10}, ir, (device-cuda, sm_70)
 // NEW-DRIVER-NEXT: 12: backend, {11}, assembler, (device-cuda, sm_70)
 // NEW-DRIVER-NEXT: 13: assembler, {12}, object, (device-cuda, sm_70)
-// NEW-DRIVER-NEXT: 14: offload, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {13}, object
-// NEW-DRIVER-NEXT: 15: clang-offload-packager, {8, 14}, image
-// NEW-DRIVER-NEXT: 16: offload, "host-cuda (powerpc64le-ibm-linux-gnu)" {2}, "device-cuda (powerpc64le-ibm-linux-gnu)" {15}, ir
+// NEW-DRIVER-NEXT: 14: offload, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {13}, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {12}, object
+// NEW-DRIVER-NEXT: 15: linker, {8, 14}, cuda-fatbin, (device-cuda)
+// NEW-DRIVER-NEXT: 16: offload, "host-cuda (powerpc64le-ibm-linux-gnu)" {2}, "device-cuda (nvptx64-nvidia-cuda)" {15}, ir
 // NEW-DRIVER-NEXT: 17: backend, {16}, assembler, (host-cuda)
 // NEW-DRIVER-NEXT: 18: assembler, {17}, object, (host-cuda)
 // NEW-DRIVER-NEXT: 19: clang-linker-wrapper, {18}, image, (host-cuda)
 
 // RUN: %clang -### --target=powerpc64le-ibm-linux-gnu -ccc-print-phases --offload-new-driver \
 // RUN:   --offload-arch=sm_52 --offload-arch=sm_70 %s %S/Inputs/empty.cpp 2>&1 | FileCheck --check-prefix=NON-CUDA-INPUT %s
+
 //      NON-CUDA-INPUT: 0: input, "[[CUDA:.+]]", cuda, (host-cuda)
 // NON-CUDA-INPUT-NEXT: 1: preprocessor, {0}, cuda-cpp-output, (host-cuda)
 // NON-CUDA-INPUT-NEXT: 2: compiler, {1}, ir, (host-cuda)
@@ -277,13 +278,13 @@
 // NON-CUDA-INPUT-NEXT: 5: compiler, {4}, ir, (device-cuda, sm_52)
 // NON-CUDA-INPUT-NEXT: 6: backend, {5}, assembler, (device-cuda, sm_52)
 // NON-CUDA-INPUT-NEXT: 7: assembler, {6}, object, (device-cuda, sm_52)
-// NON-CUDA-INPUT-NEXT: 8: offload, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {7}, object
+// NON-CUDA-INPUT-NEXT: 8: offload, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {7}, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {6}, object
 // NON-CUDA-INPUT-NEXT: 9: input, "[[CUDA]]", cuda, (device-cuda, sm_70)
 // NON-CUDA-INPUT-NEXT: 10: preprocessor, {9}, cuda-cpp-output, (device-cuda, sm_70)
 // NON-CUDA-INPUT-NEXT: 11: compiler, {10}, ir, (device-cuda, sm_70)
 // NON-CUDA-INPUT-NEXT: 12: backend, {11}, assembler, (device-cuda, sm_70)
 // NON-CUDA-INPUT-NEXT: 13: assembler, {12}, object, (device-cuda, sm_70)
-// NON-CUDA-INPUT-NEXT: 14: offload, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {13}, object
+// NON-CUDA-INPUT-NEXT: 14: offload, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {13}, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {12}, object
 // NON-CUDA-INPUT-NEXT: 15: linker, {8, 14}, cuda-fatbin, (device-cuda)
 // NON-CUDA-INPUT-NEXT: 16: offload, "host-cuda (powerpc64le-ibm-linux-gnu)" {2}, "device-cuda (nvptx64-nvidia-cuda)" {15}, ir
 // NON-CUDA-INPUT-NEXT: 17: backend, {16}, assembler, (host-cuda)

@@ -4625,7 +4625,15 @@ Action *Driver::BuildOffloadingActions(Compilation &C,
DDeps.add(*A, *TCAndArch->first, TCAndArch->second.data(), Kind);
OffloadAction::DeviceDependences DDep;
DDep.add(*A, *TCAndArch->first, TCAndArch->second.data(), Kind);

// Compiling CUDA in non-RDC mode uses the PTX output if available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still respect --cuda-include-ptx=... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the current behavior is that it will "always" set the PTX in the job and optionally include it in the fatbinary job depending on those settings.

// Compiling CUDA in non-RDC mode uses the PTX output if available.
for (Action *Input : A->getInputs())
if (Kind == Action::OFK_Cuda && A->getType() == types::TY_Object &&
!Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure why we would need to include PTX for RDC compilation.

In retrospect, including PTX by default with all compilations turned out to be a wrong default choice.
It's just a waste of space for most of the users, and it allows problems to go unnoticed for longer than they should (e.g. something was compiled for a wrong GPU).

Switching to the new driver is a good point to make a better choice. I would argue that we should not be including PTX by default or, if we do deem that it may be useful, only add it for the most recent chosen GPU variant, to provide some forward compatibility, not for all of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't have my finger on the pulse of the CUDA users here. I think we want this patch to match the current behavior with --cuda-include-ptx as it seems to make the decision whether or not to include it at job creation time. We could then potentially change the default of --cuda-include-ptx if that's the preferred solution.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Mar 7, 2024

Should I make shouldIncludePTX default to false for the new driver?

@Artem-B
Copy link
Member

Artem-B commented Mar 7, 2024

Should I make shouldIncludePTX default to false for the new driver?

Yes, I think that's a better default.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Mar 7, 2024

Should I make shouldIncludePTX default to false for the new driver?

Yes, I think that's a better default.

Done, now requires --cuda-include-ptx=.

@Artem-B
Copy link
Member

Artem-B commented Mar 7, 2024

Should I make shouldIncludePTX default to false for the new driver?

Yes, I think that's a better default.

Done, now requires --cuda-include-ptx=.

This may be worth adding to the release notes.

continue;
static bool shouldIncludePTX(const ArgList &Args, StringRef InputArch) {
// The new driver does not include PTX by default.
bool includePTX = !Args.hasFlag(options::OPT_offload_new_driver,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a comment on why we're making this decision based on the new vs old driver.

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, with docs/comment nits.

Summary:
The old driver embed PTX in rdc-mode and so does the `nvcc` compiler.
The new drivers currently does not do this, so we should keep it
consistent in this case. This simply requires adding the assembler
output as an input to the offloading action that gets fed to fatbin.
@jhuber6
Copy link
Contributor Author

jhuber6 commented Mar 7, 2024

LGTM overall, with docs/comment nits.

Done, thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants