[AMDGPU][clang][CodeGen][opt] Add late-resolved feature identifying predicates #134016

AlexVlx · 2025-04-02T02:34:56Z

This change adds two semi-magical builtins for AMDGPU:

__builtin_amdgcn_processor_is, which is similar in observable behaviour with __builtin_cpu_is, except that it is never "evaluated" at run time;
__builtin_amdgcn_is_invocable, which is behaviourally similar with __has_builtin, except that it is not a macro (i.e. not evaluated at preprocessing time).

Neither of these are constexpr, even though when compiling for concrete (i.e. gfxXXX / gfxXXX-generic) targets they get evaluated in Clang, so they shouldn't tear the AST too badly / at all for multi-pass compilation cases like HIP. They can only be used in specific contexts (as args to control structures).

The motivation for adding these is two-fold:

as a nice to have, it provides an AST-visible way to incorporate architecture specific code, rather than having to rely on macros and the preprocessor, which burn in the choice quite early;
as a must have, it allows featureful AMDGCN flavoured SPIR-V to be produced, where target specific capability is guarded and chosen or discarded when finalising compilation for a concrete target.

I've tried to keep the overall footprint of the change small. The changes to Sema are a bit unpleasant, but there was a strong desire to have Clang validate these, and to constrain their uses, and this was the most compact solution I could come up with (suggestions welcome).

In the end, I will note there is nothing that is actually AMDGPU specific here, so it is possible that in the future, assuming interests from other targets / users, we'd just promote them to generic intrinsics.

llvmbot · 2025-04-02T02:35:31Z

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-clang

Author: Alex Voicu (AlexVlx)

Changes

This change adds two semi-magical builtins for AMDGPU:

__builtin_amdgcn_processor_is, which is similar in observable behaviour with __builtin_cpu_is, except that it is never "evaluated" at run time;
__builtin_amdgcn_is_invocable, which is behaviourally similar with __has_builtin, except that it is not a macro (i.e. not evaluated at preprocessing time).

Neither of these are constexpr, even though when compiling for concrete (i.e. gfxXXX / gfxXXX-generic) targets they get evaluated in Clang, so they shouldn't tear the AST too badly / at all for multi-pass compilation cases like HIP. They can only be used in specific contexts (as args to control structures).

The motivation for adding these is two-fold:

as a nice to have, it provides an AST-visible way to incorporate architecture specific code, rather than having to rely on macros and the preprocessor, which burn in the choice quite early;
as a must have, it allows featureful AMDGCN flavoured SPIR-V to be produced, where target specific capability is guarded and chosen or discarded when finalising compilation for a concrete target.

I've tried to keep the overall footprint of the change small. The changes to Sema are a bit unpleasant, but there was a strong desire to have Clang validate these, and to constrain their uses, and this was the most compact solution I could come up with (suggestions welcome).

In the end, I will note there is nothing that is actually AMDGPU specific here, so it is possible that in the future, assuming interests from other targets / users, we'd just promote them to generic intrinsics.

Patch is 59.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/134016.diff

17 Files Affected:

(modified) clang/docs/LanguageExtensions.rst (+110)
(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+5)
(modified) clang/include/clang/Basic/DiagnosticSemaKinds.td (+10)
(modified) clang/lib/Basic/Targets/SPIR.cpp (+4)
(modified) clang/lib/Basic/Targets/SPIR.h (+4)
(modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+29)
(modified) clang/lib/Sema/SemaExpr.cpp (+157)
(added) clang/test/CodeGen/amdgpu-builtin-cpu-is.c (+65)
(added) clang/test/CodeGen/amdgpu-builtin-is-invocable.c (+64)
(added) clang/test/CodeGen/amdgpu-feature-builtins-invalid-use.cpp (+43)
(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+9)
(added) llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp (+207)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2-1)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-expand-feature-predicates-unfoldable.ll (+28)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-expand-feature-predicates.ll (+359)

diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index 3b8a9cac6587a..8a7cb75af13e5 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -4920,6 +4920,116 @@ If no address spaces names are provided, all address spaces are fenced.
   __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local")
   __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local", "global")
 
+__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``__builtin_amdgcn_processor_is`` and ``__builtin_amdgcn_is_invocable`` provide
+a functional mechanism for programatically querying:
+
+* the identity of the current target processor;
+* the capability of the current target processor to invoke a particular builtin.
+
+**Syntax**:
+
+.. code-block:: c
+
+  // When used as the predicate for a control structure
+  bool __builtin_amdgcn_processor_is(const char*);
+  bool __builtin_amdgcn_is_invocable(builtin_name);
+  // Otherwise
+  void __builtin_amdgcn_processor_is(const char*);
+  void __builtin_amdgcn_is_invocable(void);
+
+**Example of use**:
+
+.. code-block:: c++
+
+  if (__builtin_amdgcn_processor_is("gfx1201") ||
+      __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var))
+    __builtin_amdgcn_s_sleep_var(x);
+
+  if (!__builtin_amdgcn_processor_is("gfx906"))
+    __builtin_amdgcn_s_wait_event_export_ready();
+  else if (__builtin_amdgcn_processor_is("gfx1010") ||
+           __builtin_amdgcn_processor_is("gfx1101"))
+    __builtin_amdgcn_s_ttracedata_imm(1);
+
+  while (__builtin_amdgcn_processor_is("gfx1101")) *p += x;
+
+  do { *p -= x; } while (__builtin_amdgcn_processor_is("gfx1010"));
+
+  for (; __builtin_amdgcn_processor_is("gfx1201"); ++*p) break;
+
+  if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready))
+    __builtin_amdgcn_s_wait_event_export_ready();
+  else if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_ttracedata_imm))
+    __builtin_amdgcn_s_ttracedata_imm(1);
+
+  do {
+    *p -= x;
+  } while (__builtin_amdgcn_is_invocable(__builtin_amdgcn_global_load_tr_b64_i32));
+
+  for (; __builtin_amdgcn_is_invocable(__builtin_amdgcn_permlane64); ++*p) break;
+
+**Description**:
+
+When used as the predicate value of the following control structures:
+
+.. code-block:: c++
+
+  if (...)
+  while (...)
+  do { } while (...)
+  for (...)
+
+be it directly, or as arguments to logical operators such as ``!, ||, &&``, the
+builtins return a boolean value that:
+
+* indicates whether the current target matches the argument; the argument MUST
+  be a string literal and a valid AMDGPU target
+* indicates whether the builtin function passed as the argument can be invoked
+  by the current target; the argument MUST be either a generic or AMDGPU
+  specific builtin name
+
+Outside of these contexts, the builtins have a ``void`` returning signature
+which prevents their misuse.
+
+**Example of invalid use**:
+
+.. code-block:: c++
+
+  void kernel(int* p, int x, bool (*pfn)(bool), const char* str) {
+    if (__builtin_amdgcn_processor_is("not_an_amdgcn_gfx_id")) return;
+    else if (__builtin_amdgcn_processor_is(str)) __builtin_trap();
+
+    bool a = __builtin_amdgcn_processor_is("gfx906");
+    const bool b = !__builtin_amdgcn_processor_is("gfx906");
+    const bool c = !__builtin_amdgcn_processor_is("gfx906");
+    bool d = __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    bool e = !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    const auto f =
+        !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+        || __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    const auto g =
+        !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+        || !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    __builtin_amdgcn_processor_is("gfx1201")
+      ? __builtin_amdgcn_s_sleep_var(x) : __builtin_amdgcn_s_sleep(42);
+    if (pfn(__builtin_amdgcn_processor_is("gfx1200")))
+      __builtin_amdgcn_s_sleep_var(x);
+
+    if (__builtin_amdgcn_is_invocable("__builtin_amdgcn_s_sleep_var")) return;
+    else if (__builtin_amdgcn_is_invocable(x)) __builtin_trap();
+  }
+
+When invoked while compiling for a concrete target, the builtins are evaluated
+early by Clang, and never produce any CodeGen effects / have no observable
+side-effects in IR. Conversely, when compiling for AMDGCN flavoured SPIR-v,
+which is an abstract target, a series of predicate values are implicitly
+created. These predicates get resolved when finalizing the compilation process
+for a concrete target, and shall reflect the latter's identity and features.
+Thus, it is possible to author high-level code, in e.g. HIP, that is target
+adaptive in a dynamic fashion, contrary to macro based mechanisms.
 
 ARM/AArch64 Language Extensions
 -------------------------------
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 44ef404aee72f..5d01a7e75f7e7 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -346,6 +346,11 @@ BUILTIN(__builtin_amdgcn_endpgm, "v", "nr")
 BUILTIN(__builtin_amdgcn_get_fpenv, "WUi", "n")
 BUILTIN(__builtin_amdgcn_set_fpenv, "vWUi", "n")
 
+// These are special FE only builtins intended for forwarding the requirements
+// to the ME.
+BUILTIN(__builtin_amdgcn_processor_is, "vcC*", "nctu")
+BUILTIN(__builtin_amdgcn_is_invocable, "v", "nctu")
+
 //===----------------------------------------------------------------------===//
 // R600-NI only builtins.
 //===----------------------------------------------------------------------===//
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 5e45482584946..45f0f9eb88e55 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -13054,4 +13054,14 @@ def err_acc_decl_for_routine
 // AMDGCN builtins diagnostics
 def err_amdgcn_global_load_lds_size_invalid_value : Error<"invalid size value">;
 def note_amdgcn_global_load_lds_size_valid_value : Note<"size must be %select{1, 2, or 4|1, 2, 4, 12 or 16}0">;
+def err_amdgcn_processor_is_arg_not_literal
+    : Error<"the argument to __builtin_amdgcn_processor_is must be a string "
+            "literal">;
+def err_amdgcn_processor_is_arg_invalid_value
+    : Error<"the argument to __builtin_amdgcn_processor_is must be a valid "
+            "AMDGCN processor identifier; '%0' is not valid">;
+def err_amdgcn_is_invocable_arg_invalid_value
+    : Error<"the argument to __builtin_amdgcn_is_invocable must be either a "
+            "target agnostic builtin or an AMDGCN target specific builtin; `%0`"
+            " is not valid">;
 } // end of sema component.
diff --git a/clang/lib/Basic/Targets/SPIR.cpp b/clang/lib/Basic/Targets/SPIR.cpp
index 5b5f47f9647a2..eb43d9b0be283 100644
--- a/clang/lib/Basic/Targets/SPIR.cpp
+++ b/clang/lib/Basic/Targets/SPIR.cpp
@@ -152,3 +152,7 @@ void SPIRV64AMDGCNTargetInfo::setAuxTarget(const TargetInfo *Aux) {
     Float128Format = DoubleFormat;
   }
 }
+
+bool SPIRV64AMDGCNTargetInfo::isValidCPUName(StringRef CPU) const {
+  return AMDGPUTI.isValidCPUName(CPU);
+}
diff --git a/clang/lib/Basic/Targets/SPIR.h b/clang/lib/Basic/Targets/SPIR.h
index 78505d66d6f2f..7aa13cbeb89fd 100644
--- a/clang/lib/Basic/Targets/SPIR.h
+++ b/clang/lib/Basic/Targets/SPIR.h
@@ -432,6 +432,10 @@ class LLVM_LIBRARY_VISIBILITY SPIRV64AMDGCNTargetInfo final
   }
 
   bool hasInt128Type() const override { return TargetInfo::hasInt128Type(); }
+
+  // This is only needed for validating arguments passed to
+  // __builtin_amdgcn_processor_is
+  bool isValidCPUName(StringRef Name) const override;
 };
 
 } // namespace targets
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index b56b739094ff3..7b1a3815144b4 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -284,6 +284,18 @@ void CodeGenFunction::AddAMDGPUFenceAddressSpaceMMRA(llvm::Instruction *Inst,
   Inst->setMetadata(LLVMContext::MD_mmra, MMRAMetadata::getMD(Ctx, MMRAs));
 }
 
+static Value *GetOrInsertAMDGPUPredicate(CodeGenFunction &CGF, Twine Name) {
+  auto PTy = IntegerType::getInt1Ty(CGF.getLLVMContext());
+
+  auto P = cast<GlobalVariable>(
+      CGF.CGM.getModule().getOrInsertGlobal(Name.str(), PTy));
+  P->setConstant(true);
+  P->setExternallyInitialized(true);
+
+  return CGF.Builder.CreateLoad(RawAddress(P, PTy, CharUnits::One(),
+                                           KnownNonNull));
+}
+
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                                               const CallExpr *E) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
@@ -585,6 +597,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
     llvm::Value *Env = EmitScalarExpr(E->getArg(0));
     return Builder.CreateCall(F, {Env});
   }
+  case AMDGPU::BI__builtin_amdgcn_processor_is: {
+    assert(CGM.getTriple().isSPIRV() &&
+           "__builtin_amdgcn_processor_is should never reach CodeGen for "
+             "concrete targets!");
+    StringRef Proc = cast<clang::StringLiteral>(E->getArg(0))->getString();
+    return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.is." + Proc);
+  }
+  case AMDGPU::BI__builtin_amdgcn_is_invocable: {
+    assert(CGM.getTriple().isSPIRV() &&
+           "__builtin_amdgcn_is_invocable should never reach CodeGen for "
+           "concrete targets!");
+    auto FD = cast<FunctionDecl>(
+      cast<DeclRefExpr>(E->getArg(0))->getReferencedDeclOfCallee());
+    StringRef RF =
+        getContext().BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+    return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.has." + RF);
+  }
   case AMDGPU::BI__builtin_amdgcn_read_exec:
     return EmitAMDGCNBallotForExec(*this, E, Int64Ty, Int64Ty, false);
   case AMDGPU::BI__builtin_amdgcn_read_exec_lo:
diff --git a/clang/lib/Sema/SemaExpr.cpp b/clang/lib/Sema/SemaExpr.cpp
index 7cc8374e69d73..24f5262ab3cf4 100644
--- a/clang/lib/Sema/SemaExpr.cpp
+++ b/clang/lib/Sema/SemaExpr.cpp
@@ -6541,6 +6541,22 @@ ExprResult Sema::BuildCallExpr(Scope *Scope, Expr *Fn, SourceLocation LParenLoc,
   if (Result.isInvalid()) return ExprError();
   Fn = Result.get();
 
+  // The __builtin_amdgcn_is_invocable builtin is special, and will be resolved
+  // later, when we check boolean conditions, for now we merely forward it
+  // without any additional checking.
+  if (Fn->getType() == Context.BuiltinFnTy && ArgExprs.size() == 1 &&
+      ArgExprs[0]->getType() == Context.BuiltinFnTy) {
+    auto FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee());
+
+    if (FD->getName() == "__builtin_amdgcn_is_invocable") {
+      auto FnPtrTy = Context.getPointerType(FD->getType());
+      auto R = ImpCastExprToType(Fn, FnPtrTy, CK_BuiltinFnToFnPtr).get();
+      return CallExpr::Create(Context, R, ArgExprs, Context.VoidTy,
+                              ExprValueKind::VK_PRValue, RParenLoc,
+                              FPOptionsOverride());
+    }
+  }
+
   if (CheckArgsForPlaceholders(ArgExprs))
     return ExprError();
 
@@ -13234,6 +13250,20 @@ inline QualType Sema::CheckBitwiseOperands(ExprResult &LHS, ExprResult &RHS,
   return InvalidOperands(Loc, LHS, RHS);
 }
 
+static inline bool IsAMDGPUPredicateBI(Expr *E) {
+  if (!E->getType()->isVoidType())
+    return false;
+
+  if (auto CE = dyn_cast<CallExpr>(E)) {
+    if (auto BI = CE->getDirectCallee())
+      if (BI->getName() == "__builtin_amdgcn_processor_is" ||
+          BI->getName() == "__builtin_amdgcn_is_invocable")
+        return true;
+  }
+
+  return false;
+}
+
 // C99 6.5.[13,14]
 inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
                                            SourceLocation Loc,
@@ -13329,6 +13359,9 @@ inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
   // The following is safe because we only use this method for
   // non-overloadable operands.
 
+  if (IsAMDGPUPredicateBI(LHS.get()) && IsAMDGPUPredicateBI(RHS.get()))
+    return Context.VoidTy;
+
   // C++ [expr.log.and]p1
   // C++ [expr.log.or]p1
   // The operands are both contextually converted to type bool.
@@ -15576,6 +15609,38 @@ static bool isOverflowingIntegerType(ASTContext &Ctx, QualType T) {
   return Ctx.getIntWidth(T) >= Ctx.getIntWidth(Ctx.IntTy);
 }
 
+static Expr *ExpandAMDGPUPredicateBI(ASTContext &Ctx, CallExpr *CE) {
+  if (!CE->getBuiltinCallee())
+    return CXXBoolLiteralExpr::Create(Ctx, false, Ctx.BoolTy, CE->getExprLoc());
+
+  if (Ctx.getTargetInfo().getTriple().isSPIRV()) {
+    CE->setType(Ctx.getLogicalOperationType());
+    return CE;
+  }
+
+  bool P = false;
+  auto &TI = Ctx.getTargetInfo();
+
+  if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+    auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+    auto TID = TI.getTargetID();
+    if (GFX && TID) {
+      auto N = GFX->getString();
+      P = TI.isValidCPUName(GFX->getString()) && TID->find(N) == 0;
+    }
+  } else {
+    auto FD = cast<FunctionDecl>(CE->getArg(0)->getReferencedDeclOfCallee());
+
+    StringRef RF = Ctx.BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+    llvm::StringMap<bool> CF;
+    Ctx.getFunctionFeatureMap(CF, FD);
+
+    P = Builtin::evaluateRequiredTargetFeatures(RF, CF);
+  }
+
+  return CXXBoolLiteralExpr::Create(Ctx, P, Ctx.BoolTy, CE->getExprLoc());
+}
+
 ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
                                       UnaryOperatorKind Opc, Expr *InputExpr,
                                       bool IsAfterAmp) {
@@ -15753,6 +15818,8 @@ ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
         // Vector logical not returns the signed variant of the operand type.
         resultType = GetSignedVectorType(resultType);
         break;
+      } else if (IsAMDGPUPredicateBI(InputExpr)) {
+        break;
       } else {
         return ExprError(Diag(OpLoc, diag::err_typecheck_unary_expr)
                          << resultType << Input.get()->getSourceRange());
@@ -20469,6 +20536,88 @@ void Sema::DiagnoseEqualityWithExtraParens(ParenExpr *ParenE) {
     }
 }
 
+static bool ValidateAMDGPUPredicateBI(Sema &Sema, CallExpr *CE) {
+  if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+    auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+    if (!GFX) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_processor_is_arg_not_literal);
+      return false;
+    }
+    auto N = GFX->getString();
+    if (!Sema.getASTContext().getTargetInfo().isValidCPUName(N) &&
+        (!Sema.getASTContext().getAuxTargetInfo() ||
+         !Sema.getASTContext().getAuxTargetInfo()->isValidCPUName(N))) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_processor_is_arg_invalid_value) << N;
+      return false;
+    }
+  } else {
+    auto Arg = CE->getArg(0);
+    if (!Arg || Arg->getType() != Sema.getASTContext().BuiltinFnTy) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_is_invocable_arg_invalid_value) << Arg;
+      return false;
+    }
+  }
+
+  return true;
+}
+
+static Expr *MaybeHandleAMDGPUPredicateBI(Sema &Sema, Expr *E, bool &Invalid) {
+  if (auto UO = dyn_cast<UnaryOperator>(E)) {
+    auto SE = dyn_cast<CallExpr>(UO->getSubExpr());
+    if (IsAMDGPUPredicateBI(SE)) {
+      assert(
+        UO->getOpcode() == UnaryOperator::Opcode::UO_LNot &&
+        "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+          "can only be used as operands of logical ops!");
+
+      if (!ValidateAMDGPUPredicateBI(Sema, SE)) {
+        Invalid = true;
+        return nullptr;
+      }
+
+      UO->setSubExpr(ExpandAMDGPUPredicateBI(Sema.getASTContext(), SE));
+      UO->setType(Sema.getASTContext().getLogicalOperationType());
+
+      return UO;
+    }
+  }
+  if (auto BO = dyn_cast<BinaryOperator>(E)) {
+    auto LHS = dyn_cast<CallExpr>(BO->getLHS());
+    auto RHS = dyn_cast<CallExpr>(BO->getRHS());
+    if (IsAMDGPUPredicateBI(LHS) && IsAMDGPUPredicateBI(RHS)) {
+      assert(
+          BO->isLogicalOp() &&
+          "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+            "can only be used as operands of logical ops!");
+
+      if (!ValidateAMDGPUPredicateBI(Sema, LHS) ||
+          !ValidateAMDGPUPredicateBI(Sema, RHS)) {
+        Invalid = true;
+        return nullptr;
+      }
+
+      BO->setLHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), LHS));
+      BO->setRHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), RHS));
+      BO->setType(Sema.getASTContext().getLogicalOperationType());
+
+      return BO;
+    }
+  }
+  if (auto CE = dyn_cast<CallExpr>(E))
+    if (IsAMDGPUPredicateBI(CE)) {
+      if (!ValidateAMDGPUPredicateBI(Sema, CE)) {
+        Invalid = true;
+        return nullptr;
+      }
+      return ExpandAMDGPUPredicateBI(Sema.getASTContext(), CE);
+    }
+
+  return nullptr;
+}
+
 ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
                                        bool IsConstexpr) {
   DiagnoseAssignmentAsCondition(E);
@@ -20480,6 +20629,14 @@ ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
   E = result.get();
 
   if (!E->isTypeDependent()) {
+    if (E->getType()->isVoidType()) {
+      bool IsInvalidPredicate = false;
+      if (auto BIC = MaybeHandleAMDGPUPredicateBI(*this, E, IsInvalidPredicate))
+        return BIC;
+      else if (IsInvalidPredicate)
+        return ExprError();
+    }
+
     if (getLangOpts().CPlusPlus)
       return CheckCXXBooleanCondition(E, IsConstexpr); // C++ 6.4p4
 
diff --git a/clang/test/CodeGen/amdgpu-builtin-cpu-is.c b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
new file mode 100644
index 0000000000000..6e261d9f5d239
--- /dev/null
+++ b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
@@ -0,0 +1,65 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals all --version 5
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx900 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX900 %s
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx1010 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX1010 %s
+// RUN: %clang_cc1 -triple spirv64-amd-amdhsa -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCNSPIRV %s
+
+// Test that, depending on triple and, if applicable, target-cpu, one of three
+// things happens:
+//    1) for gfx900 we emit a call to trap (concrete target, matches)
+//    2) for gfx1010 we emit an empty kernel (concrete target, does not match)
+//    3) for AMDGCNSPIRV we emit llvm.amdgcn.is.gfx900 as a bool global, and
+//       load from it to provide the condition a br (abstract target)
+//.
+// AMDGCN-GFX900: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCN-GFX1010: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCNSPIRV: @llvm.amdgcn.is.gfx900 = external addrspace(1) externally_initialized constant i1
+//.
+// AMDGCN-GFX900-LABEL: define dso_local void @foo(
+// AMDGCN-GFX900-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX900-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-GFX900-NEXT:    call void @llvm.trap()
+// AMDGCN-GFX900-NEXT:    ret void
+//
+// AMDGCN-GFX1010-LABEL: define dso_local void @foo(
+// AMDGCN-GFX1010-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX1010-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-GFX1010-NEXT:    ret void
+//
+// AMDGCNSPIRV-LABEL: define spir_func void @foo(
+// AMDGCNSPIRV-SAME: ) addrspace(4) #[[ATTR0:[0-9]+]] {
+// AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = load i1, ptr addrspace(1) @llvm.amdgcn.is.gfx900, align 1
+// AMDGCNSPIRV-NEXT:    br i1 [[TMP0]], label %[[IF_THEN:.*]], label %[[IF_END:.*]]
+// AMDGCNSPIRV:       [[IF_THEN]]:
+// AMDGCNSPIRV-NEXT:    call addrspace(4) void @llvm.trap()
+// AMDGCNSPIRV-NEXT:    br label %[[IF_END]]
+// AMDGCNSPIRV:       [[IF_END]]:
+// AMDGCNSPIRV-NEXT:    ret void
+//
+void foo() {
+    if (__builtin_cpu_is("gfx90...
[truncated]

llvmbot · 2025-04-02T02:35:32Z

@llvm/pr-subscribers-clang-codegen

Author: Alex Voicu (AlexVlx)

Changes

This change adds two semi-magical builtins for AMDGPU:

__builtin_amdgcn_processor_is, which is similar in observable behaviour with __builtin_cpu_is, except that it is never "evaluated" at run time;
__builtin_amdgcn_is_invocable, which is behaviourally similar with __has_builtin, except that it is not a macro (i.e. not evaluated at preprocessing time).

Neither of these are constexpr, even though when compiling for concrete (i.e. gfxXXX / gfxXXX-generic) targets they get evaluated in Clang, so they shouldn't tear the AST too badly / at all for multi-pass compilation cases like HIP. They can only be used in specific contexts (as args to control structures).

The motivation for adding these is two-fold:

as a nice to have, it provides an AST-visible way to incorporate architecture specific code, rather than having to rely on macros and the preprocessor, which burn in the choice quite early;
as a must have, it allows featureful AMDGCN flavoured SPIR-V to be produced, where target specific capability is guarded and chosen or discarded when finalising compilation for a concrete target.

I've tried to keep the overall footprint of the change small. The changes to Sema are a bit unpleasant, but there was a strong desire to have Clang validate these, and to constrain their uses, and this was the most compact solution I could come up with (suggestions welcome).

In the end, I will note there is nothing that is actually AMDGPU specific here, so it is possible that in the future, assuming interests from other targets / users, we'd just promote them to generic intrinsics.

Patch is 59.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/134016.diff

17 Files Affected:

(modified) clang/docs/LanguageExtensions.rst (+110)
(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+5)
(modified) clang/include/clang/Basic/DiagnosticSemaKinds.td (+10)
(modified) clang/lib/Basic/Targets/SPIR.cpp (+4)
(modified) clang/lib/Basic/Targets/SPIR.h (+4)
(modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+29)
(modified) clang/lib/Sema/SemaExpr.cpp (+157)
(added) clang/test/CodeGen/amdgpu-builtin-cpu-is.c (+65)
(added) clang/test/CodeGen/amdgpu-builtin-is-invocable.c (+64)
(added) clang/test/CodeGen/amdgpu-feature-builtins-invalid-use.cpp (+43)
(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+9)
(added) llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp (+207)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2-1)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-expand-feature-predicates-unfoldable.ll (+28)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-expand-feature-predicates.ll (+359)

diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index 3b8a9cac6587a..8a7cb75af13e5 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -4920,6 +4920,116 @@ If no address spaces names are provided, all address spaces are fenced.
   __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local")
   __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local", "global")
 
+__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``__builtin_amdgcn_processor_is`` and ``__builtin_amdgcn_is_invocable`` provide
+a functional mechanism for programatically querying:
+
+* the identity of the current target processor;
+* the capability of the current target processor to invoke a particular builtin.
+
+**Syntax**:
+
+.. code-block:: c
+
+  // When used as the predicate for a control structure
+  bool __builtin_amdgcn_processor_is(const char*);
+  bool __builtin_amdgcn_is_invocable(builtin_name);
+  // Otherwise
+  void __builtin_amdgcn_processor_is(const char*);
+  void __builtin_amdgcn_is_invocable(void);
+
+**Example of use**:
+
+.. code-block:: c++
+
+  if (__builtin_amdgcn_processor_is("gfx1201") ||
+      __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var))
+    __builtin_amdgcn_s_sleep_var(x);
+
+  if (!__builtin_amdgcn_processor_is("gfx906"))
+    __builtin_amdgcn_s_wait_event_export_ready();
+  else if (__builtin_amdgcn_processor_is("gfx1010") ||
+           __builtin_amdgcn_processor_is("gfx1101"))
+    __builtin_amdgcn_s_ttracedata_imm(1);
+
+  while (__builtin_amdgcn_processor_is("gfx1101")) *p += x;
+
+  do { *p -= x; } while (__builtin_amdgcn_processor_is("gfx1010"));
+
+  for (; __builtin_amdgcn_processor_is("gfx1201"); ++*p) break;
+
+  if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready))
+    __builtin_amdgcn_s_wait_event_export_ready();
+  else if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_ttracedata_imm))
+    __builtin_amdgcn_s_ttracedata_imm(1);
+
+  do {
+    *p -= x;
+  } while (__builtin_amdgcn_is_invocable(__builtin_amdgcn_global_load_tr_b64_i32));
+
+  for (; __builtin_amdgcn_is_invocable(__builtin_amdgcn_permlane64); ++*p) break;
+
+**Description**:
+
+When used as the predicate value of the following control structures:
+
+.. code-block:: c++
+
+  if (...)
+  while (...)
+  do { } while (...)
+  for (...)
+
+be it directly, or as arguments to logical operators such as ``!, ||, &&``, the
+builtins return a boolean value that:
+
+* indicates whether the current target matches the argument; the argument MUST
+  be a string literal and a valid AMDGPU target
+* indicates whether the builtin function passed as the argument can be invoked
+  by the current target; the argument MUST be either a generic or AMDGPU
+  specific builtin name
+
+Outside of these contexts, the builtins have a ``void`` returning signature
+which prevents their misuse.
+
+**Example of invalid use**:
+
+.. code-block:: c++
+
+  void kernel(int* p, int x, bool (*pfn)(bool), const char* str) {
+    if (__builtin_amdgcn_processor_is("not_an_amdgcn_gfx_id")) return;
+    else if (__builtin_amdgcn_processor_is(str)) __builtin_trap();
+
+    bool a = __builtin_amdgcn_processor_is("gfx906");
+    const bool b = !__builtin_amdgcn_processor_is("gfx906");
+    const bool c = !__builtin_amdgcn_processor_is("gfx906");
+    bool d = __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    bool e = !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    const auto f =
+        !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+        || __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    const auto g =
+        !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+        || !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    __builtin_amdgcn_processor_is("gfx1201")
+      ? __builtin_amdgcn_s_sleep_var(x) : __builtin_amdgcn_s_sleep(42);
+    if (pfn(__builtin_amdgcn_processor_is("gfx1200")))
+      __builtin_amdgcn_s_sleep_var(x);
+
+    if (__builtin_amdgcn_is_invocable("__builtin_amdgcn_s_sleep_var")) return;
+    else if (__builtin_amdgcn_is_invocable(x)) __builtin_trap();
+  }
+
+When invoked while compiling for a concrete target, the builtins are evaluated
+early by Clang, and never produce any CodeGen effects / have no observable
+side-effects in IR. Conversely, when compiling for AMDGCN flavoured SPIR-v,
+which is an abstract target, a series of predicate values are implicitly
+created. These predicates get resolved when finalizing the compilation process
+for a concrete target, and shall reflect the latter's identity and features.
+Thus, it is possible to author high-level code, in e.g. HIP, that is target
+adaptive in a dynamic fashion, contrary to macro based mechanisms.
 
 ARM/AArch64 Language Extensions
 -------------------------------
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 44ef404aee72f..5d01a7e75f7e7 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -346,6 +346,11 @@ BUILTIN(__builtin_amdgcn_endpgm, "v", "nr")
 BUILTIN(__builtin_amdgcn_get_fpenv, "WUi", "n")
 BUILTIN(__builtin_amdgcn_set_fpenv, "vWUi", "n")
 
+// These are special FE only builtins intended for forwarding the requirements
+// to the ME.
+BUILTIN(__builtin_amdgcn_processor_is, "vcC*", "nctu")
+BUILTIN(__builtin_amdgcn_is_invocable, "v", "nctu")
+
 //===----------------------------------------------------------------------===//
 // R600-NI only builtins.
 //===----------------------------------------------------------------------===//
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 5e45482584946..45f0f9eb88e55 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -13054,4 +13054,14 @@ def err_acc_decl_for_routine
 // AMDGCN builtins diagnostics
 def err_amdgcn_global_load_lds_size_invalid_value : Error<"invalid size value">;
 def note_amdgcn_global_load_lds_size_valid_value : Note<"size must be %select{1, 2, or 4|1, 2, 4, 12 or 16}0">;
+def err_amdgcn_processor_is_arg_not_literal
+    : Error<"the argument to __builtin_amdgcn_processor_is must be a string "
+            "literal">;
+def err_amdgcn_processor_is_arg_invalid_value
+    : Error<"the argument to __builtin_amdgcn_processor_is must be a valid "
+            "AMDGCN processor identifier; '%0' is not valid">;
+def err_amdgcn_is_invocable_arg_invalid_value
+    : Error<"the argument to __builtin_amdgcn_is_invocable must be either a "
+            "target agnostic builtin or an AMDGCN target specific builtin; `%0`"
+            " is not valid">;
 } // end of sema component.
diff --git a/clang/lib/Basic/Targets/SPIR.cpp b/clang/lib/Basic/Targets/SPIR.cpp
index 5b5f47f9647a2..eb43d9b0be283 100644
--- a/clang/lib/Basic/Targets/SPIR.cpp
+++ b/clang/lib/Basic/Targets/SPIR.cpp
@@ -152,3 +152,7 @@ void SPIRV64AMDGCNTargetInfo::setAuxTarget(const TargetInfo *Aux) {
     Float128Format = DoubleFormat;
   }
 }
+
+bool SPIRV64AMDGCNTargetInfo::isValidCPUName(StringRef CPU) const {
+  return AMDGPUTI.isValidCPUName(CPU);
+}
diff --git a/clang/lib/Basic/Targets/SPIR.h b/clang/lib/Basic/Targets/SPIR.h
index 78505d66d6f2f..7aa13cbeb89fd 100644
--- a/clang/lib/Basic/Targets/SPIR.h
+++ b/clang/lib/Basic/Targets/SPIR.h
@@ -432,6 +432,10 @@ class LLVM_LIBRARY_VISIBILITY SPIRV64AMDGCNTargetInfo final
   }
 
   bool hasInt128Type() const override { return TargetInfo::hasInt128Type(); }
+
+  // This is only needed for validating arguments passed to
+  // __builtin_amdgcn_processor_is
+  bool isValidCPUName(StringRef Name) const override;
 };
 
 } // namespace targets
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index b56b739094ff3..7b1a3815144b4 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -284,6 +284,18 @@ void CodeGenFunction::AddAMDGPUFenceAddressSpaceMMRA(llvm::Instruction *Inst,
   Inst->setMetadata(LLVMContext::MD_mmra, MMRAMetadata::getMD(Ctx, MMRAs));
 }
 
+static Value *GetOrInsertAMDGPUPredicate(CodeGenFunction &CGF, Twine Name) {
+  auto PTy = IntegerType::getInt1Ty(CGF.getLLVMContext());
+
+  auto P = cast<GlobalVariable>(
+      CGF.CGM.getModule().getOrInsertGlobal(Name.str(), PTy));
+  P->setConstant(true);
+  P->setExternallyInitialized(true);
+
+  return CGF.Builder.CreateLoad(RawAddress(P, PTy, CharUnits::One(),
+                                           KnownNonNull));
+}
+
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                                               const CallExpr *E) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
@@ -585,6 +597,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
     llvm::Value *Env = EmitScalarExpr(E->getArg(0));
     return Builder.CreateCall(F, {Env});
   }
+  case AMDGPU::BI__builtin_amdgcn_processor_is: {
+    assert(CGM.getTriple().isSPIRV() &&
+           "__builtin_amdgcn_processor_is should never reach CodeGen for "
+             "concrete targets!");
+    StringRef Proc = cast<clang::StringLiteral>(E->getArg(0))->getString();
+    return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.is." + Proc);
+  }
+  case AMDGPU::BI__builtin_amdgcn_is_invocable: {
+    assert(CGM.getTriple().isSPIRV() &&
+           "__builtin_amdgcn_is_invocable should never reach CodeGen for "
+           "concrete targets!");
+    auto FD = cast<FunctionDecl>(
+      cast<DeclRefExpr>(E->getArg(0))->getReferencedDeclOfCallee());
+    StringRef RF =
+        getContext().BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+    return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.has." + RF);
+  }
   case AMDGPU::BI__builtin_amdgcn_read_exec:
     return EmitAMDGCNBallotForExec(*this, E, Int64Ty, Int64Ty, false);
   case AMDGPU::BI__builtin_amdgcn_read_exec_lo:
diff --git a/clang/lib/Sema/SemaExpr.cpp b/clang/lib/Sema/SemaExpr.cpp
index 7cc8374e69d73..24f5262ab3cf4 100644
--- a/clang/lib/Sema/SemaExpr.cpp
+++ b/clang/lib/Sema/SemaExpr.cpp
@@ -6541,6 +6541,22 @@ ExprResult Sema::BuildCallExpr(Scope *Scope, Expr *Fn, SourceLocation LParenLoc,
   if (Result.isInvalid()) return ExprError();
   Fn = Result.get();
 
+  // The __builtin_amdgcn_is_invocable builtin is special, and will be resolved
+  // later, when we check boolean conditions, for now we merely forward it
+  // without any additional checking.
+  if (Fn->getType() == Context.BuiltinFnTy && ArgExprs.size() == 1 &&
+      ArgExprs[0]->getType() == Context.BuiltinFnTy) {
+    auto FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee());
+
+    if (FD->getName() == "__builtin_amdgcn_is_invocable") {
+      auto FnPtrTy = Context.getPointerType(FD->getType());
+      auto R = ImpCastExprToType(Fn, FnPtrTy, CK_BuiltinFnToFnPtr).get();
+      return CallExpr::Create(Context, R, ArgExprs, Context.VoidTy,
+                              ExprValueKind::VK_PRValue, RParenLoc,
+                              FPOptionsOverride());
+    }
+  }
+
   if (CheckArgsForPlaceholders(ArgExprs))
     return ExprError();
 
@@ -13234,6 +13250,20 @@ inline QualType Sema::CheckBitwiseOperands(ExprResult &LHS, ExprResult &RHS,
   return InvalidOperands(Loc, LHS, RHS);
 }
 
+static inline bool IsAMDGPUPredicateBI(Expr *E) {
+  if (!E->getType()->isVoidType())
+    return false;
+
+  if (auto CE = dyn_cast<CallExpr>(E)) {
+    if (auto BI = CE->getDirectCallee())
+      if (BI->getName() == "__builtin_amdgcn_processor_is" ||
+          BI->getName() == "__builtin_amdgcn_is_invocable")
+        return true;
+  }
+
+  return false;
+}
+
 // C99 6.5.[13,14]
 inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
                                            SourceLocation Loc,
@@ -13329,6 +13359,9 @@ inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
   // The following is safe because we only use this method for
   // non-overloadable operands.
 
+  if (IsAMDGPUPredicateBI(LHS.get()) && IsAMDGPUPredicateBI(RHS.get()))
+    return Context.VoidTy;
+
   // C++ [expr.log.and]p1
   // C++ [expr.log.or]p1
   // The operands are both contextually converted to type bool.
@@ -15576,6 +15609,38 @@ static bool isOverflowingIntegerType(ASTContext &Ctx, QualType T) {
   return Ctx.getIntWidth(T) >= Ctx.getIntWidth(Ctx.IntTy);
 }
 
+static Expr *ExpandAMDGPUPredicateBI(ASTContext &Ctx, CallExpr *CE) {
+  if (!CE->getBuiltinCallee())
+    return CXXBoolLiteralExpr::Create(Ctx, false, Ctx.BoolTy, CE->getExprLoc());
+
+  if (Ctx.getTargetInfo().getTriple().isSPIRV()) {
+    CE->setType(Ctx.getLogicalOperationType());
+    return CE;
+  }
+
+  bool P = false;
+  auto &TI = Ctx.getTargetInfo();
+
+  if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+    auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+    auto TID = TI.getTargetID();
+    if (GFX && TID) {
+      auto N = GFX->getString();
+      P = TI.isValidCPUName(GFX->getString()) && TID->find(N) == 0;
+    }
+  } else {
+    auto FD = cast<FunctionDecl>(CE->getArg(0)->getReferencedDeclOfCallee());
+
+    StringRef RF = Ctx.BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+    llvm::StringMap<bool> CF;
+    Ctx.getFunctionFeatureMap(CF, FD);
+
+    P = Builtin::evaluateRequiredTargetFeatures(RF, CF);
+  }
+
+  return CXXBoolLiteralExpr::Create(Ctx, P, Ctx.BoolTy, CE->getExprLoc());
+}
+
 ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
                                       UnaryOperatorKind Opc, Expr *InputExpr,
                                       bool IsAfterAmp) {
@@ -15753,6 +15818,8 @@ ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
         // Vector logical not returns the signed variant of the operand type.
         resultType = GetSignedVectorType(resultType);
         break;
+      } else if (IsAMDGPUPredicateBI(InputExpr)) {
+        break;
       } else {
         return ExprError(Diag(OpLoc, diag::err_typecheck_unary_expr)
                          << resultType << Input.get()->getSourceRange());
@@ -20469,6 +20536,88 @@ void Sema::DiagnoseEqualityWithExtraParens(ParenExpr *ParenE) {
     }
 }
 
+static bool ValidateAMDGPUPredicateBI(Sema &Sema, CallExpr *CE) {
+  if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+    auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+    if (!GFX) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_processor_is_arg_not_literal);
+      return false;
+    }
+    auto N = GFX->getString();
+    if (!Sema.getASTContext().getTargetInfo().isValidCPUName(N) &&
+        (!Sema.getASTContext().getAuxTargetInfo() ||
+         !Sema.getASTContext().getAuxTargetInfo()->isValidCPUName(N))) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_processor_is_arg_invalid_value) << N;
+      return false;
+    }
+  } else {
+    auto Arg = CE->getArg(0);
+    if (!Arg || Arg->getType() != Sema.getASTContext().BuiltinFnTy) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_is_invocable_arg_invalid_value) << Arg;
+      return false;
+    }
+  }
+
+  return true;
+}
+
+static Expr *MaybeHandleAMDGPUPredicateBI(Sema &Sema, Expr *E, bool &Invalid) {
+  if (auto UO = dyn_cast<UnaryOperator>(E)) {
+    auto SE = dyn_cast<CallExpr>(UO->getSubExpr());
+    if (IsAMDGPUPredicateBI(SE)) {
+      assert(
+        UO->getOpcode() == UnaryOperator::Opcode::UO_LNot &&
+        "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+          "can only be used as operands of logical ops!");
+
+      if (!ValidateAMDGPUPredicateBI(Sema, SE)) {
+        Invalid = true;
+        return nullptr;
+      }
+
+      UO->setSubExpr(ExpandAMDGPUPredicateBI(Sema.getASTContext(), SE));
+      UO->setType(Sema.getASTContext().getLogicalOperationType());
+
+      return UO;
+    }
+  }
+  if (auto BO = dyn_cast<BinaryOperator>(E)) {
+    auto LHS = dyn_cast<CallExpr>(BO->getLHS());
+    auto RHS = dyn_cast<CallExpr>(BO->getRHS());
+    if (IsAMDGPUPredicateBI(LHS) && IsAMDGPUPredicateBI(RHS)) {
+      assert(
+          BO->isLogicalOp() &&
+          "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+            "can only be used as operands of logical ops!");
+
+      if (!ValidateAMDGPUPredicateBI(Sema, LHS) ||
+          !ValidateAMDGPUPredicateBI(Sema, RHS)) {
+        Invalid = true;
+        return nullptr;
+      }
+
+      BO->setLHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), LHS));
+      BO->setRHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), RHS));
+      BO->setType(Sema.getASTContext().getLogicalOperationType());
+
+      return BO;
+    }
+  }
+  if (auto CE = dyn_cast<CallExpr>(E))
+    if (IsAMDGPUPredicateBI(CE)) {
+      if (!ValidateAMDGPUPredicateBI(Sema, CE)) {
+        Invalid = true;
+        return nullptr;
+      }
+      return ExpandAMDGPUPredicateBI(Sema.getASTContext(), CE);
+    }
+
+  return nullptr;
+}
+
 ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
                                        bool IsConstexpr) {
   DiagnoseAssignmentAsCondition(E);
@@ -20480,6 +20629,14 @@ ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
   E = result.get();
 
   if (!E->isTypeDependent()) {
+    if (E->getType()->isVoidType()) {
+      bool IsInvalidPredicate = false;
+      if (auto BIC = MaybeHandleAMDGPUPredicateBI(*this, E, IsInvalidPredicate))
+        return BIC;
+      else if (IsInvalidPredicate)
+        return ExprError();
+    }
+
     if (getLangOpts().CPlusPlus)
       return CheckCXXBooleanCondition(E, IsConstexpr); // C++ 6.4p4
 
diff --git a/clang/test/CodeGen/amdgpu-builtin-cpu-is.c b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
new file mode 100644
index 0000000000000..6e261d9f5d239
--- /dev/null
+++ b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
@@ -0,0 +1,65 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals all --version 5
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx900 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX900 %s
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx1010 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX1010 %s
+// RUN: %clang_cc1 -triple spirv64-amd-amdhsa -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCNSPIRV %s
+
+// Test that, depending on triple and, if applicable, target-cpu, one of three
+// things happens:
+//    1) for gfx900 we emit a call to trap (concrete target, matches)
+//    2) for gfx1010 we emit an empty kernel (concrete target, does not match)
+//    3) for AMDGCNSPIRV we emit llvm.amdgcn.is.gfx900 as a bool global, and
+//       load from it to provide the condition a br (abstract target)
+//.
+// AMDGCN-GFX900: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCN-GFX1010: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCNSPIRV: @llvm.amdgcn.is.gfx900 = external addrspace(1) externally_initialized constant i1
+//.
+// AMDGCN-GFX900-LABEL: define dso_local void @foo(
+// AMDGCN-GFX900-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX900-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-GFX900-NEXT:    call void @llvm.trap()
+// AMDGCN-GFX900-NEXT:    ret void
+//
+// AMDGCN-GFX1010-LABEL: define dso_local void @foo(
+// AMDGCN-GFX1010-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX1010-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-GFX1010-NEXT:    ret void
+//
+// AMDGCNSPIRV-LABEL: define spir_func void @foo(
+// AMDGCNSPIRV-SAME: ) addrspace(4) #[[ATTR0:[0-9]+]] {
+// AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = load i1, ptr addrspace(1) @llvm.amdgcn.is.gfx900, align 1
+// AMDGCNSPIRV-NEXT:    br i1 [[TMP0]], label %[[IF_THEN:.*]], label %[[IF_END:.*]]
+// AMDGCNSPIRV:       [[IF_THEN]]:
+// AMDGCNSPIRV-NEXT:    call addrspace(4) void @llvm.trap()
+// AMDGCNSPIRV-NEXT:    br label %[[IF_END]]
+// AMDGCNSPIRV:       [[IF_END]]:
+// AMDGCNSPIRV-NEXT:    ret void
+//
+void foo() {
+    if (__builtin_cpu_is("gfx90...
[truncated]

github-actions · 2025-04-02T02:37:17Z

✅ With the latest revision this PR passed the C/C++ code formatter.

jhuber6

Very cool, in general I'm a fan of being able to use LLVM-IR as a more general target. We already hack around these things in practice, so I think it's only beneficial to formalize is in a more correct way, even if LLVM-IR wasn't 'strictly' intended to be this kind of serialization format.

jhuber6 · 2025-04-02T02:44:39Z

clang/test/CodeGen/amdgpu-builtin-is-invocable.c

+// AMDGCNSPIRV-NEXT:    ret void
+//
+void foo() {
+    if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_permlanex16))


Is this intended to handle builtins that require certain target features to be set?

Could we get a test? Something simple like +dpp?

Could we get a test? Something simple like +dpp?

Sure, but if possible, could you clarify what you would like to be tested / what you expect to see, so that we avoid churning.

The issue with how the ROCm device libs does it, is that certain builtins require target features to be used. It hacks around this with the __attribute__((target)). I just want to know that you can call a builtin that requires +ddp features without that.

shiltian

This is worth a release note item.

AlexVlx · 2025-04-02T02:47:25Z

This is worth a release note item.

Indeed! I botched moving the changes from my internal scratchpad, and the rel notes got lost; fixing.

AaronBallman · 2025-06-17T11:30:26Z

clang/docs/LanguageExtensions.rst

+a functional mechanism for programatically querying:
+
+* the identity of the current target processor;
+* the capability of the current target processor to invoke a particular builtin.


CC @sarnex @jhuber6 as this relates to __has_builtin behavior somewhat and we've been in discussions about whether that should mean "I know about the builtin" or "I can actually call the builtin".

AaronBallman · 2025-06-17T11:34:06Z

clang/docs/LanguageExtensions.rst

+``__amdgpu_feature_predicate_t`` type behaves as an opaque, forward declared
+type with conditional automated conversion to ``_Bool`` when used as the


We should have a test case that stuff like this is diagnosed as using an incomplete type?

typeof(__builtin_amdgcn_processor_is("gfx900")) what;

Similar in C++ with decltype?

AaronBallman · 2025-06-17T11:35:54Z

clang/lib/Sema/SemaExpr.cpp

+  // without any additional checking.
+  if (Fn->getType() == Context.BuiltinFnTy && ArgExprs.size() == 1 &&
+      ArgExprs[0]->getType() == Context.BuiltinFnTy) {
+    auto *FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee());


Suggested change

auto *FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee());

const auto *FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee());

AaronBallman · 2025-06-17T11:37:58Z

clang/lib/Sema/SemaInit.cpp

+    // __amdgpu_feature_predicate_t can be explicitly cast to the logical op
+    // type, although this is almost always an error and we advise against it


Suggested change

// __amdgpu_feature_predicate_t can be explicitly cast to the logical op

// type, although this is almost always an error and we advise against it

// __amdgpu_feature_predicate_t can be explicitly cast to the logical op

// type, although this is almost always an error and we advise against it.

I'm a bit confused; the comment says it's allowed but implies we should at least warn on it. But we're emitting an error diagnostic for that?

The error is trying to do this without a cast, which is what is being tested (this is the same behaviour one gets around a type that has an explicit conversion operator for bool). The error message is trying to inform the user that if they REALLY want to do this, they can, but they have to explicitly cast via C-cast or C++ static_cast, however they probably should not as it is almost always going to lead to an error. I am happy to reword the error message / add a warning on casting from __amdgpu_feature_predicate_t to bool / _Bool.

Ah, I see the logic now, I just didn't read hard enough. :-D

So why would the explicit cast almost always lead to an error given that we're doing the cast implicitly for them anyway?

Ah, I see the logic now, I just didn't read hard enough. :-D

So why would the explicit cast almost always lead to an error given that we're doing the cast implicitly for them anyway?

Context:) Since we're modelling this after a type with an explicit operator bool(), which is not regular (no default, copy or move ctors) only contextual conversions are valid i.e. (off the top of my head, I'd have to look up the exact reference in the standard):

controlling expression for if, while, for;

first operand of the ternary / conditional operator ?;

operands for guilt-in logical operators;

some constexpr / consteval cases that do not apply.

So, we do want if (__builtin_amdgcn_processor_is("gfx900") to just work as typed, as this is an intended usage pattern (user checks for a particular target, then does something that is target specific on the true branch).

However, if your expression is not contextually convertible to bool/_Bool, you are probably walking into trouble by doing something potentially nefarious like this:

void foo(const std::vector<bool>& ps) { // do stuff based on indexing into the predicate vector, e.g. // assume that the 0th element holds gfx900, the 1st holds // gfx901 etc. } void bar() { vector<bool> ps; ps.push_back(__builtin_amdgcn_processor_is("gfx900"); ps.push_back(__builtin_amdgcn_processor_is("gfx901"); }

which might make a lot of internal sense for the client app, and looks pretty harmless, but is
not something we can support in the limit (once it's buried under 5 additional layers of indirection, at -O0 etc.), and which is likely to end up a bug farm. These really do have sharp edges so the intention is to be defensive by default and ensure that it's very hard to step on a landmine.

I think I'm still missing something. If

if (__builtin_amdgcn_processor_is("gfx900")) ps.push_back(true);

works.... why would

ps.push_back(__builtin_amdgcn_processor_is("gfx900"));

be dangerous?

but is not something we can support in the limit (once it's buried under 5 additional layers of indirection, at -O0 etc.)

Why would optimization modes matter?

Apologies if these are dumb questions on my part. :-)

Let's take these piecewise. Your first example actually works / those are equivalent. I think the danger here is assuming that the sort of easy to type examples we are playing with are representative of where issues show up - they are not. The cases where things break down are somewhat more intricate - I chose pushing into a container via a function with side-effects on purpose.

I suspect that the sense that something is tied to optimiser behaviour is due to my reply above, which perhaps was insufficiently clear - apologies. I was trying to explain why making it trivial to store these as booleans somewhere leads to having to run large parts of the optimisation pipeline. There is no dependence on the optimiser, O0 and Ox behave in the same way in what regards the predicates, because we have a dedicated pass that unconditionally runs early in the pipeline, irrespective of optimisation level, and either succeeds at the needed folding or fails and diagnoses.

The __has_builtin counter-example actually does not work and cannot work, please see: https://gcc.godbolt.org/z/7G5Y1d85b. It fact, it's the essence of why we need these, the fact that that pattern does not work and cannot work, and yet it is extremely useful. Those situations are materially different because:

this is not about calling some generic omni-available code, it's about calling target specific code - this has to be statically decided on in the compiler, we MUST know if the target can run it or not, which is why this is a target specific BI;

furthermore, the ...later... bit is pretty important: what happens on that path? do you pass the boolean by reference into an extern function which gets linked in at run time (i.e. no idea what it does)? do you mutate the value based on a run time value? if you do any of those, your distant has_builtin variable no longer reflects the predicate, which is the issue;

the answer to the above bit might be "make it constexpr" - sure, but then it rolls back into not working for abstract targets / resolving these late, which is the gap in functionality with things like the __has_builtin macro that these try to fill.

I think the default expectation is that you should be able to query the processor information at any point you want, store the results anywhere you want, and use them later with the expected semantics - I don't think this is actually the case, unless what you are thinking about is __builtin_cpu_is, which is a different mechanism that operates at execution time.

Overall, this might be less profound / convoluted than we've made it seem:

use the predicates as intended, things work;

explicitly cast to bool and then stash:
a) if the chain formed from the point of cast to the final point of use can be folded in a terminator, for all uses of
the cast, happy days;
b) if for a chain from point of cast to final point of use folding fails (because you passed your value to an
opaque function, modified it based on a run time value etc.), you get an error and a diagnostic.

This is independent from optimisation level, and essentially matches what you would have to do with __has_builtin as well (except you'd have to make the stashed variable constexpr and then make the control structure be something like if constexpr).

I suspect that the sense that something is tied to optimiser behaviour is due to my reply above, which perhaps was insufficiently clear - apologies.

No worries, this is complex stuff! I appreciate your willingness to talk me through it. :-)

The __has_builtin counter-example actually does not work and cannot work, please see: https://gcc.godbolt.org/z/7G5Y1d85b.

I cannot imagine a situation in which this isn't indicative of a bug, but perhaps this situation is the same one that necessitated this PR which eventually concluded that we should change the behavior of __has_builtin rather than introduce a new builtin.

furthermore, the ...later... bit is pretty important: what happens on that path?

Anything in the world besides changing the value of has_builtin to something other than what __has_builtin returned.

if you do any of those, your distant has_builtin variable no longer reflects the predicate, which is the issue;

Why is that an issue? If the variable no longer reflects the predicate, that's not on the compiler to figure out how to deal with, that's "play silly games, win silly prizes".

Backing up a step.. my expectation is that this eventually lowers down to a test and jump which jumps past the target code if the test fails. e.g.,

%0 = load i8, ptr %b, align 1 %loadedv = trunc i8 %0 to i1 br i1 %loadedv, label %if.then, label %if.end if.then: # the target-specific instructions live here br label %if.end if.end: ret void

So we'd be generating instructions for the target which may be invalid if the test lies. If something did change that value so it no longer represents the predicate, I think that's UB (and we could help users catch that UB via a sanitizer check if we wanted to, rather than try to make the backend have to try to figure it out at compile time).

if for a chain from point of cast to final point of use folding fails (because you passed your value to an
opaque function, modified it based on a run time value etc.), you get an error and a diagnostic.

I was thinking you would not get a diagnostic; you'd get the behavior you asked for, which may be utter nonsense.

Am I missing something still? If so, maybe it would be quicker for us to hop in a telecon call? I'm going to be out of the office until Monday, but I'm happy to meet with you if that's more productive.

The __has_builtin counter-example actually does not work and cannot work, please see: https://gcc.godbolt.org/z/7G5Y1d85b.

I cannot imagine a situation in which this isn't indicative of a bug, but perhaps this situation is the same one that necessitated this PR which eventually concluded that we should change the behavior of __has_builtin rather than introduce a new builtin.

This is not actually a bug, it's intended behaviour. To obtain what you expect the b would have to be constexpr, and then the if itself would have to be if constexpr. Otherwise there's no binding commitment to evaluate this at compile time (and, in effect, if this gets trivially evaluated / removed in the FE, it induces dependence on optimisation level).

furthermore, the ...later... bit is pretty important: what happens on that path?

Anything in the world besides changing the value of has_builtin to something other than what __has_builtin returned.

if you do any of those, your distant has_builtin variable no longer reflects the predicate, which is the issue;

Why is that an issue? If the variable no longer reflects the predicate, that's not on the compiler to figure out how to deal with, that's "play silly games, win silly prizes".

It is a difficult conversation to have and not exactly what users want to hear, so making it as hard as possible to end up in an exchange where you have to say "welp, that was a silly game" cannot hurt. If anything, it's compassionate behaviour!

Backing up a step.. my expectation is that this eventually lowers down to a test and jump which jumps past the target code if the test fails. e.g.,

%0 = load i8, ptr %b, align 1 %loadedv = trunc i8 %0 to i1 br i1 %loadedv, label %if.then, label %if.end if.then: # the target-specific instructions live here br label %if.end if.end: ret void

So we'd be generating instructions for the target which may be invalid if the test lies. If something did change that value so it no longer represents the predicate, I think that's UB (and we could help users catch that UB via a sanitizer check if we wanted to, rather than try to make the backend have to try to figure it out at compile time).

This cannot work reliably (e.g. there are instructions that would simply fail at ISEL, and a run time jump doesn't mean that you do not lower to ISA the jumped around block), and introducing dependence on sanitizers seems not ideal. Furthermore, a run time jump isn't free, which is a concern for us, and we also already have a mechanism for that case (__attribute__((target))). Note that these can also control e.g. resource allocation, so actually generating both might lead to arbitrary exhaustion of a limited resource, and spurious compilation failures, consider e.g. (I'll use CUDA/HIP syntax):

// This is a bit odd, and technically a race because multiple lanes write to shared_buf void foo() { __shared__ int* shared_buf; if (__builtin_amdgcn_processor_is("gfx950") { __shared__ int buf[70 * 1024]; shared_buf = buf; } else { __shared__ int buf[60 * 1024]; shared_buf = buf; } __syncthreads(); // use shared_buf

If we tried to lower that we'd exhaust LDS, and spuriously fail to compile. This would have originated from perfectly valid uses of #if defined(__gfx950__) #else. We'd like these to work, so we must unambiguously do the fold ourselves.

if for a chain from point of cast to final point of use folding fails (because you passed your value to an
opaque function, modified it based on a run time value etc.), you get an error and a diagnostic.

I was thinking you would not get a diagnostic; you'd get the behavior you asked for, which may be utter nonsense.

One of the difficulties here (ignoring that the utter nonsense behaviour at run time might be nasal demons - GPUs aren't always as polite as to issue a SIGILL and graciously die:)) is that not all constructs / IR sequences / ASM uses lower into ISA, so what the user is more likely to get is an ICE with an error that makes no sense unless they work on LLVM. That's fairly grim user experience, IMHO, and one that we have the ability to prevent.

Am I missing something still? If so, maybe it would be quicker for us to hop in a telecon call? I'm going to be out of the office until Monday, but I'm happy to meet with you if that's more productive.

I would be absolutely happy to if you think it'd help. I regret not coming to the Sofia meeting, we could've probably sorted this out directly with a laptop:)

The __has_builtin counter-example actually does not work and cannot work, please see: https://gcc.godbolt.org/z/7G5Y1d85b.

I cannot imagine a situation in which this isn't indicative of a bug, but perhaps this situation is the same one that necessitated this PR which eventually concluded that we should change the behavior of __has_builtin rather than introduce a new builtin.

This is not actually a bug, it's intended behaviour. To obtain what you expect the b would have to be constexpr, and then the if itself would have to be if constexpr. Otherwise there's no binding commitment to evaluate this at compile time (and, in effect, if this gets trivially evaluated / removed in the FE, it induces dependence on optimisation level).

I... am an idiot. :-D Sorry, I think I must have been braindead when I wrote that because you're exactly correct. Sorry for the noise!

Backing up a step.. my expectation is that this eventually lowers down to a test and jump which jumps past the target code if the test fails. e.g.,

%0 = load i8, ptr %b, align 1 %loadedv = trunc i8 %0 to i1 br i1 %loadedv, label %if.then, label %if.end if.then: # the target-specific instructions live here br label %if.end if.end: ret void

So we'd be generating instructions for the target which may be invalid if the test lies. If something did change that value so it no longer represents the predicate, I think that's UB (and we could help users catch that UB via a sanitizer check if we wanted to, rather than try to make the backend have to try to figure it out at compile time).

This cannot work reliably (e.g. there are instructions that would simply fail at ISEL, and a run time jump doesn't mean that you do not lower to ISA the jumped around block), and introducing dependence on sanitizers seems not ideal. Furthermore, a run time jump isn't free, which is a concern for us, and we also already have a mechanism for that case (__attribute__((target))). Note that these can also control e.g. resource allocation, so actually generating both might lead to arbitrary exhaustion of a limited resource, and spurious compilation failures, consider e.g. (I'll use CUDA/HIP syntax):

// This is a bit odd, and technically a race because multiple lanes write to shared_buf void foo() { __shared__ int* shared_buf; if (__builtin_amdgcn_processor_is("gfx950") { __shared__ int buf[70 * 1024]; shared_buf = buf; } else { __shared__ int buf[60 * 1024]; shared_buf = buf; } __syncthreads(); // use shared_buf

If we tried to lower that we'd exhaust LDS, and spuriously fail to compile. This would have originated from perfectly valid uses of #if defined(__gfx950__) #else. We'd like these to work, so we must unambiguously do the fold ourselves.

Okay, so the situation is different than what I expected. I was unaware this would cause ISEL failures.

if for a chain from point of cast to final point of use folding fails (because you passed your value to an
opaque function, modified it based on a run time value etc.), you get an error and a diagnostic.

I was thinking you would not get a diagnostic; you'd get the behavior you asked for, which may be utter nonsense.

One of the difficulties here (ignoring that the utter nonsense behaviour at run time might be nasal demons - GPUs aren't always as polite as to issue a SIGILL and graciously die:)) is that not all constructs / IR sequences / ASM uses lower into ISA, so what the user is more likely to get is an ICE with an error that makes no sense unless they work on LLVM. That's fairly grim user experience, IMHO, and one that we have the ability to prevent.

Yeah, we obviously don't want the user experience to be compiler crashes. :-)

Am I missing something still? If so, maybe it would be quicker for us to hop in a telecon call? I'm going to be out of the office until Monday, but I'm happy to meet with you if that's more productive.

I would be absolutely happy to if you think it'd help. I regret not coming to the Sofia meeting, we could've probably sorted this out directly with a laptop:)

FWIW, I'm still pretty uncomfortable about this design. I keep coming back to this feeling really novel and seeming like it's designed to work around backend issues. If the user did something like this:

void func(std::vector<bool> processor_features) { if (processor_features[12]) { // SSE3 is allowed __asm__ ("do a bunch of sse3 stuff"); } else { // Do slow fallback stuff } }

they would reasonably expect that inline assembly to be non-problematic even if sse3 isn't available. But... when I try to play silly games in practice, we assert: https://godbolt.org/z/xc13WhW4W and so maybe I'm just wrong. CC @nikic for more opinions.

As for meeting to discuss, are you free sometime this week? I'm on US East Coast time, what times typically work best for you?

From the bottom up, anything but Friday should be good, including today starting from now to now + 6 hours:) I'm in the UK, so the delta is not so large anyway, pick something that fits your schedule and I'll probably be able to make it work.

For your example at the bottom, the ASM is non-problematic in that it goes through. Now substitute it with a builtin that is only there iff SSE3 is available, or try to bind registers from the extended x86_64 set and compile for an x86 target, and it'll go back to failing at compile time. It's that latter part that is problematic even for the user's experience.

I suspect that part of the issue here is that something like X86 hides a lot of this stuff under normal circumstances because folks don't really normally grab for special functionality, or feel the need for it. But if we have a look at the many SIMD extension sets, as well as the attempts at defining various levels of capability (the v1, v2, v3 things) I think the same challenge exists there, it's just not an immediate concern. What we're trying to solve is that whilst it makes perfect sense for a BE, any BE, to bind very tightly to the target, it is sometimes beneficial for the IR coming out of the FE to be generic and usable by many targets, without loss of capability. Without a mechanism as the one here one is either degraded to lowest common denominator capability, or has to play games trying to define capability levels, which generally end up being too coarse.

Also, please note that, in spite of me mentioning x86, at this point we are not proposing this for general use, but rather as a target specific BI, which hopefully reduces risk / contains any perceived novelty to parts where it's already been found to be useful:)

AaronBallman · 2025-06-17T11:38:49Z

clang/lib/Sema/SemaOverload.cpp

@@ -12015,6 +12017,16 @@ static void DiagnoseBadConversion(Sema &S, OverloadCandidate *Cand,
  if (TakingCandidateAddress && !checkAddressOfCandidateIsAvailable(S, Fn))
    return;

+  // __amdgpu_feature_predicate_t can be explicitly cast to the logical op type,
+  // although this is almost always an error and we advise against it.
+  if (FromTy == S.Context.AMDGPUFeaturePredicateTy &&


Same here as above regarding the comment not seeming to match the behavior.

nikic

I'm generally very unhappy about any kind of functionality that can cause compilation failures either because the optimizer did not optimize enough (including at O0) or because it optimized too much (producing code patterns that are no longer recognized as trivially dead).

nikic · 2025-06-23T20:13:51Z

llvm/lib/Target/AMDGPU/AMDGPUExpandFeaturePredicates.cpp

+      continue;
+    if (G.getName().starts_with("llvm.amdgcn."))
+      Predicates.push_back(&G);
+  }


This needs to be represented using an intrinsic instead of magic globals. Otherwise transforming load @g into %phi = phi [ @g ]; load %phi becomes an invalid transform.

nikic · 2025-06-23T20:16:11Z

llvm/lib/Target/AMDGPU/AMDGPUExpandFeaturePredicates.cpp

+}
+
+std::pair<PreservedAnalyses, bool> handlePredicate(const GCNSubtarget &ST,
+                                                   GlobalVariable *P) {


I think this per-predicate handling is going to break if two predicates get combined into a logical and at the IR level? When the first one is handled it will leave an unfoldable user, which would be foldable if both are handled.

This is true but I think that the this'd entail something like SCEV or InstructionSimplify running before the predicate expansion pass, which shouldn't happen with how we are ordering it. I might be missing some obvious case though, so if you have something in mind please share.

Possibly I'm misunderstanding how the pipeline here looks like. My assumption was that you have something like this going on:

clang generates IR -> compilation 1 without known target -> compilation 2 with known target

Where the predicates are expanded at the start of compilation 2, but compilation 1 could have arbitrarily optimized the IR before that.

If the resolution always happens immediately on the clang-generated IR, then I don't understand the purpose of the feature (as compared to always resolving in the frontend, that is).

Oh, this is a good question, it's probably gotten lost in the lengthy conversation. We have two cases, let me try to clarify:

We are targeting a concrete gfx### target, for which the features and capabilities are fully known at compile time / we know what we are lowering for -> the predicates get expanded and resolved in the FE, they never reach codegen / get emitted in IR;

We are targeting amdgcnspirv, which is abstract and for which the actual concrete target is only known at run time i.e. there's a lack of information / temporal decoupling:

the predicates allow one to write code that adapts to the capabilities of the actual target that the code will execute on;

we only know the target once we resume compilation for the concrete target, hence the need to emit them in IR, and then expand.

The ultimate state of affairs (not there yet due to historical issues / ongoing work) is that for the 2nd case the IR we generate SPIRV from is directly the pristine Clang output (+transforms needed for SPIRV, which do not impact these), so when we resume compilation at run time, it's on un-optimised FE-output IR. Furthermore, the expansion pass runs unconditionally, and is independent from optimisation level (which also implies it needs to be better about cleaning after itself, which I still owe an answer for). Hopefully that helps / makes some degree of sense?

So to clarify, optimizations will never be applied during the compilation to amdgcnspirv? If that's the case, I guess it's not likely that IR will be transformed in problematic ways.

It did occur to me that a way to guarantee that the folding works is by using a callbr intrinsic, something like this:

callbr void @llvm.amdgcn.processor.is(metadata "gfx803") to label %unsupported [label %supported]

This would make the check fundamentally inseparable from the control flow.

But I guess you'd have trouble round-tripping that via SPIRV...

So to clarify, optimizations will never be applied during the compilation to amdgcnspirv? If that's the case, I guess it's not likely that IR will be transformed in problematic ways.

Yes, this is the intention, it is still ongoing work - empirically we are not running into any of the potential issues you brought up, which is why I went ahead with upstreaming this part which is fairly important for library work (hard to author high-performance generic libs without this sort of mechanism). By the end of this year we should end up generating SPIRV from Clang's LLVMIR output, with no optimisations applied.

It did occur to me that a way to guarantee that the folding works is by using a callbr intrinsic, something like this:

callbr void @llvm.amdgcn.processor.is(metadata "gfx803") to label %unsupported [label %supported]

This would make the check fundamentally inseparable from the control flow.

But I guess you'd have trouble round-tripping that via SPIRV...

Ah, I actually hadn't thought of that but having had a glance yes, it's difficult to round trip. Something to consider in the future and if / when we try to make this generic rather than target specific, if there is interest.

llvm/lib/Target/AMDGPU/AMDGPUExpandFeaturePredicates.cpp

nikic · 2025-06-23T20:31:07Z

llvm/lib/Target/AMDGPU/AMDGPUExpandFeaturePredicates.cpp

+    } else if (I->isTerminator() && ConstantFoldTerminator(I->getParent())) {
+      continue;
+    } else if (I->users().empty()) {
+      continue;


I don't understand this case? Wouldn't this means that doing something like store i1 %predicate, ptr %somewhere would count as "folded"?

This is ill-formed (and also somewhat vestigial; a consequence of this being a bit long in the tooth), thank you for spotting it! The intention was to allow things like implicitly inserted llvm.assumes / prevent them from causing spurious failures. However, as you point out, this was wrong.

nikic · 2025-06-23T20:32:00Z

llvm/lib/Target/AMDGPU/AMDGPUExpandFeaturePredicates.cpp

+    return PreservedAnalyses::all();
+
+  const auto &ST = TM.getSubtarget<GCNSubtarget>(
+      *find_if(M, [](auto &&F) { return !F.isIntrinsic(); }));


This seems to be assuming the same subtarget for all functions? Does amdgpu not support target-features at all?

It does but the (gfxSMTH) target is uniform per compilation. The mechanism is roundabout but there's no other convenient way to query this information, at leas that I am aware of.

It's not convenient, but you should evaluate this in each individual function context. Really most of the targets should have been defined as full targets, not subtargets

in the real world the subtarget features for xnack may still differ between functions in a module

nikic · 2025-06-23T20:33:11Z

llvm/lib/Target/AMDGPU/AMDGPUExpandFeaturePredicates.cpp

+    auto *I = *ToFold.begin();
+    ToFold.erase(I);
+
+    if (auto *C = ConstantFoldInstruction(I, P->getDataLayout())) {


Do I understand correctly that this is relying on more optimization to happen afterwards for correctness, including at O0? We need the unreachable blocks to be DCEd, and any now unused functions to be DCEd, etc, otherwise we may get isel failures?

In what regards unreachable BBs, this looks like so because I hadn't fully considered the implications, and because my understanding is that we (LLVM, not AMDGPU) unconditionally run 'UnreachableBlockElimPass', irrespective of optimisation level. I think the latter is not incorrect, and that there is at least one other transform ('LowerInvokePass') that creates unreachable BBs and leaves them around. Having said that, it's not very hygienic and I will add cleanup for unreachable BBs.

With functions it's a bit trickier, and can actually get into somewhat convoluted use cases, which these predicates, as low-level target specific things, are not meant for. To be more specific, with normal use one would expect that for any and all functions the user would've applied predicates locally at the finest possible granularity i.e.

// THIS void foo() { if (__builtin_processor_is("gfx900")) do_something(); else if (__builtin_is_invocable(__builtin_amdgcn_some_builtin)) __builtin_amdgcn_some_builtin(); } void bar() { foo(); } // NOT THIS void foo() { do_something_that_only_works_on_gfx900_no_guard(); __builtin_amdgcn_only_gfx900(); } void bar() { if (__builtin_processor_is("gfx900")) foo(); }

If the guards are granular at expression / block scope at most, then there's no need to remove unused functions as they'd have been "cleaned up", for lack of a better word. I do appreciate that that is not an entirely satisfactory answer. I would lightly argue that since the second case is an anti-pattern (imagine these are proper large functions), it failing at compile time during ISEL is not that bad / an opportunity to not write it in the first place. Having said that, here's how we could handle functions:

We could remove functions with internal linkage, iff they end up unused after predicate expansion, as that implies that their only uses were predicate guarded;

We cannot do this for functions with external linkage (using internal and external loosely here), as they might have other valid uses in other TUs;

What we can do for the latter is:

Tag (metadata / attribute) when running the predicate expansion makes a previously used function unused;

Add an UnreachableFuncElimPass which unconditionally runs right before ISEL, and removes functions iff they are unused and carry the tag;

We can only do this for AMDGPU since at the moment we do not do dynamic linking

Dealing with the first category is straightforward, I could add it now or in a follow-up patch (I am not entirely sure that we do not already remove these unconditionally before ISEL anyway, the AMDGPU opt pipeline is fairly voluminous).

AlexVlx · 2025-06-23T20:52:44Z

I'm generally very unhappy about any kind of functionality that can cause compilation failures either because the optimizer did not optimize enough (including at O0) or because it optimized too much (producing code patterns that are no longer recognized as trivially dead).

Fortunately, that wouldn't be the case here, I don't think, unless you have something specific in mind (asides from the inquiry about what happens with now inaccessible blocks / dead functions, which I'll address where it was asked).

efriedma-quic · 2025-06-27T16:49:31Z

We briefly discussed this in the clang area team meeting, and we weren't really happy with the design as-is. The underlying idea behind the feature makes sense, but semantics of the actual builtin is ugly: there's a loose connection between the condition check, and the region of code it's guarding.

I spent a bit more time thinking about it after the meeting. Here's a potential alternative design: we add a new kind of if statement, a "processor-feature-if", spelled something like if __amdgcn_processor_is("gfx900") {}. In the body of the if statement, you're allowed to use builtins that would otherwise be illegal. This ensures a direct connection between the feature check and the corresponding builtins, so the frontend can analyze your usage and generate accurate diagnostics.

In the case where the target features are known during clang codegen, lowering is easy: you just skip generating the bodies of the if statements that don't match. If you want to some kind of "runtime" (actual runtime, or SPIR-V compilation-time) detection, it's not clear what the LLVM IR should look like: we only support specifying target features on a per-function level. But we can look at that separately.

rjmccall · 2025-06-27T20:11:24Z

Recognizing when the if condition is just a call to the builtin (possibly parenthesized or &&ed) seems totally sufficient to me. You could check that the builtin isn't used in any other position if you want, but I don't think that's really necessary.

efriedma-quic · 2025-06-27T20:25:01Z

I mean, I'm not particularly attached to the syntax of the "if". I guess we could designate if (__builtin_amdgcn_processor_is("gfx900")) {} as a "processor-feature-if". The point is that we need to know at the AST level which processor features are available for each statement.

rjmccall · 2025-06-27T21:02:07Z

Yeah, I agree with the other parts of your design, enabling the builtins within the guarded statements is a great way to handle it.

On a different point: I don't think this builtin is actually semantically different from __builtin_cpu_is. As long as we're not treating it as constexpr, the fact that it's lowered by the compiler and doesn't need a runtime check is just a happy property of GPU targeting rather than a fundamental difference. You could certainly imagine targets that do simply do this with a runtime switch. And the behavior of allowing additional builtin to be used within the guarded block seems like a nice feature that other targets would probably like to take advantage of.

We could allow __builtin_processor_is as an alternative name for that builtin if folks feel weird about having "cpu" in the name for a GPU target.

jhuber6 · 2025-06-27T21:07:41Z

We could allow __builtin_processor_is as an alternative name for that builtin if folks feel weird about having "cpu" in the name for a GPU target.

We already use -mcpu=gfx942 for targeting the GPU processor so I don't think it makes a huge difference. I've never heard of __builtin_cpu_is, doesn't seem like it has a single test in clang. Realistically what the GPU use-case needs is a way to avoid the normal builtin feature checks and guarantee that any children of that check get trimmed at O0 before the backend runs but after clang code generation.

AlexVlx · 2025-06-30T13:00:01Z

I mean, I'm not particularly attached to the syntax of the "if". I guess we could designate if (__builtin_amdgcn_processor_is("gfx900")) {} as a "processor-feature-if". The point is that we need to know at the AST level which processor features are available for each statement.

I don't quite see how to parse this statement to make it address the actual use case. These are useful because we cannot know, at the AST level (in the FE) which processor features are available. If we knew that we don't really need any additional mechanism, so this is just a different way to type #if defined / __has_builtin, which is not what is desired.

AlexVlx · 2025-06-30T13:09:18Z

On a different point: I don't think this builtin is actually semantically different from __builtin_cpu_is. As long as we're not treating it as constexpr, the fact that it's lowered by the compiler and doesn't need a runtime check is just a happy property of GPU targeting rather than a fundamental difference. You could certainly imagine targets that do simply do this with a runtime switch. And the behavior of allowing additional builtin to be used within the guarded block seems like a nice feature that other targets would probably like to take advantage of.

We could allow __builtin_processor_is as an alternative name for that builtin if folks feel weird about having "cpu" in the name for a GPU target.

The processor_is interface initially did not exist, but rather __builtin_cpu_is gained the ability to be statically resolved in the FE in certain cases / generate no run time code. There was strong opposition from some of my colleagues (some of which are on this thread) claiming that the semantics of __builtin_cpu_is mandate the existence of a run time check. The "cpu" bit wasn't really a problem:)

If you / other Clang owners are happy with extending __builtin_cpu_is, personally I would prefer that since I believe that it can be beneficial for targets other than ours / GPUs in general. For example, even for x86, there's a difference between e.g. x86_64-v2 and znver5, which could be resolved in the FE and remove the need to do a cpuid check at run time, and then go via a function call rather than direct inline code.

AlexVlx · 2025-06-30T14:56:43Z

We briefly discussed this in the clang area team meeting, and we weren't really happy with the design as-is. The underlying idea behind the feature makes sense, but semantics of the actual builtin is ugly: there's a loose connection between the condition check, and the region of code it's guarding.

Whilst I am thankful for the feedback I think it is somewhat unfortunate that we could not have a shared discussion about this, since I think that there are some core misunderstandings that seem to recur, which makes forward progress one way or the other difficult.

I spent a bit more time thinking about it after the meeting. Here's a potential alternative design: we add a new kind of if statement, a "processor-feature-if", spelled something like if __amdgcn_processor_is("gfx900") {}. In the body of the if statement, you're allowed to use builtins that would otherwise be illegal. This ensures a direct connection between the feature check and the corresponding builtins, so the frontend can analyze your usage and generate accurate diagnostics.

This has been considered, and doesn't quite address the use case (without ending up where the currently proposed design already is). Whilst this would have been significantly easier to discuss directly, I will try to enumerate the issues here:

we should not be touching the HLL at such an intrinsic (pun purely accidental) level i.e. we should not inject bespoke keywords / control structures etc. - this is extremely risky and can well be weaponised into making arguments (Clang / LLVM have implemented this, so it is clearly the way) about what should be standardised, and there isn't nearly enough experience to warrant that; we'd also be preventing meaningful cross-compiler single source / forcing other compilers to implement the exact same novel control structure; it's easier to detect a builtin than to detect whether a FE supports a novel keyword / control structure;
In relation with the above, I think that part of the confusion here is that the assumption is that the use case here is a mechanism like __device__ i.e. it's just about inline specifying a blob of "device" code in some "host" source, which is essentially ad-hoc language subsetting / dialect generation - that is absolutely not the case;
Whilst the discussion is using processor_is examples, that is for ease of parsing, arguably the is_invocable check is significantly more useful, as it operates at the right granularity; a particular builtin might be available across many architectures, so checking for that rather than a particular processor ensures that the code will tightly adapt, without changes; we are not just trying to find a way to mirror -mcpu;
The front-end cannot generate accurate diagnostics for the actual interesting case where the target is abstract (AMDGCNSPIRV, or the generic target @jhuber6 uses in GPU libc, if we extend things in that direction), because there isn't enough information - we only know what the concrete target is, and hence what features are available, at a point in time which is sequenced after the front-end has finished processing (could be run-time JIT for SPIR-V, could be bit code linking in a completely different compilation for GPU libc etc.);
There is not watertight mechanism here in the presence of indirect function calls / pointers to function, unless we start infecting function types (otherwise stated, the ABI) with their feature requirements, which would be extremely unfortunate, and probably intractable (because now things like the dynamic linker have to start caring); there is no "safe-design with accurate diagnostics" that prevents some user from checking for a predicate, then calling, via pointer, a function they imported from a library that is utterly incompatible with the invariants established by the predicate; these are specialist tools, with sharp edges, of which we do have quite a few already.

In the case where the target features are known during clang codegen, lowering is easy: you just skip generating the bodies of the if statements that don't match. If you want to some kind of "runtime" (actual runtime, or SPIR-V compilation-time) detection, it's not clear what the LLVM IR should look like: we only support specifying target features on a per-function level. But we can look at that separately.

As I have already mentioned in one of the replies to @rjmccall , this would be duplicating existing functionality, possibly in a more verbose and roundabout way. It is also already handled by what is being proposed, hence the awareness of it was present when the currently proposed design was put together. The interesting case is the second one, so, sadly, we cannot just look at that separately (and, IMHO, should not come up with novel IR constructs to solve this).

If the core of the objection here is that Clang really doesn't like that we're doing semantic checking for these / they have too large a footprint for target specific builtins, I can always just delete those bits, have these return bool and that's that. It will force us to maintain an OOT delta, so it's not ideal, but if that's what it takes to make forward progress, it is what it is.

rjmccall · 2025-06-30T16:25:35Z

On a different point: I don't think this builtin is actually semantically different from __builtin_cpu_is. As long as we're not treating it as constexpr, the fact that it's lowered by the compiler and doesn't need a runtime check is just a happy property of GPU targeting rather than a fundamental difference. You could certainly imagine targets that do simply do this with a runtime switch. And the behavior of allowing additional builtin to be used within the guarded block seems like a nice feature that other targets would probably like to take advantage of.
We could allow __builtin_processor_is as an alternative name for that builtin if folks feel weird about having "cpu" in the name for a GPU target.

The processor_is interface initially did not exist, but rather __builtin_cpu_is gained the ability to be statically resolved in the FE in certain cases / generate no run time code. There was strong opposition from some of my colleagues (some of which are on this thread) claiming that the semantics of __builtin_cpu_is mandate the existence of a run time check. The "cpu" bit wasn't really a problem:)

If you / other Clang owners are happy with extending __builtin_cpu_is, personally I would prefer that since I believe that it can be beneficial for targets other than ours / GPUs in general. For example, even for x86, there's a difference between e.g. x86_64-v2 and znver5, which could be resolved in the FE and remove the need to do a cpuid check at run time, and then go via a function call rather than direct inline code.

Right, I don't see any semantic reason why __builtin_cpu_is or __builtin_cpu_supports shouldn't be folded statically if we have sufficient information. -mcpu / -march isn't sufficient for folding __builtin_cpu_is, since those arguments just specify a minimum architecture and the builtin is doing an exact check, but that's emergent and shouldn't be taken as an inherent limitation of the builtin.

Where exactly the folding is done doesn't seem like something that we need to have an opinion on at the language level. As long as we're not making it a constant expression (which would specifically force it to be folded by the frontend), folding in the frontend vs. folding in a later pass seems like an implementation detail that programmers don't need to care about. So if AMDGPU needs to fold in a pass, great, fold in a pass.

rjmccall · 2025-06-30T16:42:13Z

Alex, can you talk about why your design decides to check for specific builtins rather than building out the set of features supported by __builtin_cpu_supports?

AlexVlx · 2025-06-30T16:45:03Z

Right, I don't see any semantic reason why __builtin_cpu_is or __builtin_cpu_supports shouldn't be resolved statically if we have that information on hand. -mcpu / -march are not usually sufficient for folding __builtin_cpu_is, since those attributes just specify a minimum architecture and the builtin is doing an exact check, but that's emergent and shouldn't be taken as an inherent limitation of the builtin.

This would be up to the target to evaluate, as it'd have knowledge of whether the argument to the call is sufficient. When I initially implemented this, I added additional supportsConstEvalCpuIs and a tryToFoldCpuIs interfaces (name is up for bikeshedding) to the TargetInfo base, and then a target would be in a position to choose whether to implement those and what to do about it. For example, in our case, the entire ExpandPredicate stuff from this patch would've ended up as the implementation for the fold attempt. As I said, if folks would prefer that (or if they just want __builtin_cpu_is to gain the abilities), I can either switch __builtin_amdgcn_processor_is back to that or create a separate PR.

Slightly independently, cpu_supports might turn out a bit difficult to use, at least for us (and I suspect other targets), as the feature definitions are often ad hoc and sadly mutable. Hence, telling users to check for this or that particular feature would be brittle and intrusive (we define features in a way that makes sense for the BE primarily, and don't really document them). Conversely, builtin names are somewhat more stable and easier to reason about.

rjmccall · 2025-06-30T16:55:25Z

Slightly independently, cpu_supports might turn out a bit difficult to use, at least for us (and I suspect other targets), as the feature definitions are often ad hoc and sadly mutable.

Hmm. Well, you get to define what feature names you recognize in __builtin_cpu_supports, so there's no reason to use marketing names, and if they're unstable you probably shouldn't. But I would imagine there's some stable technical grouping that's coarser-grained than whether an individual builtin is available.

AlexVlx · 2025-06-30T17:03:27Z

Alex, can you talk about why your design decides to check for specific builtins rather than building out the set of features supported by __builtin_cpu_supports?

I went into it a bit above without having seen your question (race condition I guess:) ), but to have it in one spot:

AMDGPU features are a bit volatile and subject to disruptive change, sadly (we should be better about this but it's going to be a marathon, and it's not entirely under our - LLVM compiler team - control);
We don't really document the features / they are formulated in a way that makes sense for the BE, and maybe for a compiler dev, but would be extremely confusing for an user - for example note that we have about a dozen DOT related features, which aren't always inclusive of each other, so you cannot actually infer that DOTn implies DOTn-1;
Conversely, the builtins devs reach for most often implement some specific capability i.e. just mirror an ISA instruction that they want to use (e.g. mfma / wmma), and these are documented via the ISA docs we publish, so having a per-builtin check seemed to match common usage and benefited from what is already in place as opposed to depending on hypothetical long-pole changes.

Now, this is specific to AMDGPU, I don't want to speculate too much about how other targets deal with this - which is another reason for which these are target builtins rather than going for something more generic.

MrSidims · 2025-06-30T17:11:35Z

Let me add my few cents here.

In the case where the target features are known during clang codegen, lowering is easy: you just skip generating the bodies of the if statements that don't match. If you want to some kind of "runtime" (actual runtime, or SPIR-V compilation-time) detection, it's not clear what the LLVM IR should look like: we only support specifying target features on a per-function level. But we can look at that separately.

Let me try to attempt to answer this question without introducing a new builtin in clang (at first). In SPIR-V there is specialization constant which AFAIK doesn't have a direct LLVM IR counterpart.
Some pseudo-code on SPIR-V would be looking like this:

%int = OpTypeInt 32 1
%runtime_known_hw_id = OpSpecConstant %int 0 // global var
%hw_id_that_supports_feature = OpConstant %int 42

kernel void foo(...) {
/* ... */
%cmp = OpIEqual %bool %runtime_known_hw_id %hw_id_that_supports_feature
if (%cmp = true) {
/* some feature */
} else {
/* other feature */
}

At runtime, when such SPIR-V module is JIT compiled OpSpecConstant materializes, so DCE (or better say some variation of DCE that is enforced to work with optnone) will be able to reason about %cmp result removing the dead branch, so we won't get unsupported feature at codegen.

Problem is: how to generate such SPIR-V from clang. So my understanding, that the new builtin should eventually lowered (by SPIR-V backend?) to a construct like in the pseudo-code, though that is not what is currently happening. And I believe, that existing __builtin_cpu_supports is not a good match for such lowering.

AlexVlx · 2025-06-30T17:38:55Z

Let me add my few cents here.

In the case where the target features are known during clang codegen, lowering is easy: you just skip generating the bodies of the if statements that don't match. If you want to some kind of "runtime" (actual runtime, or SPIR-V compilation-time) detection, it's not clear what the LLVM IR should look like: we only support specifying target features on a per-function level. But we can look at that separately.

Let me try to attempt to answer this question without introducing a new builtin in clang (at first). In SPIR-V there is specialization constant which AFAIK doesn't have a direct LLVM IR counterpart. Some pseudo-code on SPIR-V would be looking like this:
%int = OpTypeInt 32 1
%runtime_known_hw_id = OpSpecConstant %int 0 // global var
%hw_id_that_supports_feature = OpConstant %int 42

kernel void foo(...) {
/* ... */
%cmp = OpIEqual %bool %runtime_known_hw_id %hw_id_that_supports_feature
if (%cmp = true) {
/* some feature */
} else {
/* other feature */
}
At runtime, when such SPIR-V module is JIT compiled OpSpecConstant materializes, so DCE (or better say some variation of DCE that is enforced to work with optnone) will be able to reason about %cmp result removing the dead branch, so we won't get unsupported feature at codegen.

Problem is: how to generate such SPIR-V from clang. So my understanding, that the new builtin should eventually lowered (by SPIR-V backend?) to a construct like in the pseudo-code, though that is not what is currently happening. And I believe, that existing __builtin_cpu_supports is not a good match for such lowering.

This is one possible implementation indeed, for a workflow that goes from SPIR-V to ISA, or chooses to do the DCE in SPIR-V. Due to having to compose with an existing mature toolchain, rather than starting fresh, we have a slightly different flow where we reverse translate to LLVM IR and "resume" compilation from that point. Hence, the implicitly inserted never to be emitted globals, which play the role the spec constants play in your example, when coupled with the dedicated predicate expansion pass. Something similar could be added to e.g. spirv-opt. Thank you for the example, it is helpful.

jhuber6 · 2025-06-30T17:44:14Z

High liklihood that I'll need something similar for my GPU libraries so I'd prefer something not explicitly tied to SPIR-V.

rjmccall · 2025-06-30T17:57:59Z

An intrinsic seems like the right IR model for CPU recognition, even for targets that don't specifically need to late-resolve it. That should be much easier for passes to optimize based on CPU settings than directly emitting the compiler-rt reference in the frontend. I know that generating IR with conservative target options and then bumping the target CPU in a pass is something various people have been interested in, so late optimization is specifically worth planning for here.

We do have a theoretical problem with guaranteeing that non-matching code isn't emitted, because LLVM IR doesn't promise to leave a code sequence like this alone:

  %0 = call @llvm.compiler_supports(...)
  br i1 %0, label %foo, label %bar

LLVM could theoretically complicate this by e.g. introducing a PHI or an or. But that's a general LLVM problem that any lowering would have to deal with.

AlexVlx · 2025-06-30T18:42:13Z

An intrinsic seems like the right IR model for CPU recognition, even for targets that don't specifically need to late-resolve it. That should be much easier for passes to optimize based on CPU settings than directly emitting the compiler-rt reference in the frontend. I know that generating IR with conservative target options and then bumping the target CPU in a pass is something various people have been interested in, so late optimization is specifically worth planning for here.

We do have a theoretical problem with guaranteeing that non-matching code isn't emitted, because LLVM IR doesn't promise to leave a code sequence like this alone:
  %0 = call @llvm.compiler_supports(...)
  br i1 %0, label %foo, label %bar
LLVM could theoretically complicate this by e.g. introducing a PHI or an or. But that's a general LLVM problem that any lowering would have to deal with.

The solution we went with here (for our use case) is to just run the predicate expansion pass over pristine Clang generated IR, before any other optimisation. I think that @nikic suggested an alternative based on callbr, but that'd be somewhat challenging to represent in SPIRV which is important to us, but then again this could just be an implementation detail for cpu_is gets lowered, I guess? I.e., since we know we're only ever going to deal this early, we could just leave the call in place since we know no optimisation will complicate things, conversely other targets could go with callbr etc.

efriedma-quic · 2025-06-30T19:01:55Z

Whilst I am thankful for the feedback I think it is somewhat unfortunate that we could not have a shared discussion about this, since I think that there are some core misunderstandings that seem to recur, which makes forward progress one way or the other difficult.

We didn't really say much on the call itself; we just spent a minute while we were going through controversial RFCs/PRs, to call this out as something that needed attention.

If you think this topic would benefit from a meeting, we can organize one... but maybe a 1-on-1 chat would be better to start with, just to make sure we're on the same page.

The front-end cannot generate accurate diagnostics for the actual interesting case where the target is abstract (AMDGCNSPIRV, or the generic target @jhuber6 uses in GPU libc, if we extend things in that direction), because there isn't enough information - we only know what the concrete target is, and hence what features are available, at a point in time which is sequenced after the front-end has finished processing (could be run-time JIT for SPIR-V, could be bit code linking in a completely different compilation for GPU libc etc.);

If you have a construct like the following:

if (__builtin_amdgcn_processor_is("gfx900"))) {
  some_gfx9000_specific_intrinsic();
}

some_gfx9000_specific_intrinsic()

We can tell, statically, that the first call is correctly guarded by an if statement: it's guaranteed it will never run on a non-gfx9000 processor. The second call, on the other hand, is not. So we can add a frontend rule: the first call is legal, the second is not. Obviously the error has false positives, in the sense that we can't actually prove the second call is incorrect at runtime... but that's fine, probably.

What I don't want is that we end up with, essentially, the same constraint, but enforced by the backend.

There is not watertight mechanism here in the presence of indirect function calls / pointers to function

Sure; we can't stop people from calling arbitrary pointers.

We do have a theoretical problem with guaranteeing that non-matching code isn't emitted, because LLVM IR doesn't promise to leave a code sequence like this alone:

There are ways to solve this: for example, we can make the llvm.compiler.supports produce a token, and staple that token onto the intrinsics using a bundle. Making this work requires that IRGen knows which intrinic calls are actually impacted...

I care less about exactly how we solve this because we can adjust the solution later. Whatever we expose in the frontend is much harder to change later.

AlexVlx · 2025-06-30T22:05:06Z

If you think this topic would benefit from a meeting, we can organize one... but maybe a 1-on-1 chat would be better to start with, just to make sure we're on the same page.

Definitely, more than happy to have a 1-on-1 (2-on-1 even, since I think @AaronBallman also suggested something along these lines as well :) ).

The front-end cannot generate accurate diagnostics for the actual interesting case where the target is abstract (AMDGCNSPIRV, or the generic target @jhuber6 uses in GPU libc, if we extend things in that direction), because there isn't enough information - we only know what the concrete target is, and hence what features are available, at a point in time which is sequenced after the front-end has finished processing (could be run-time JIT for SPIR-V, could be bit code linking in a completely different compilation for GPU libc etc.);

If you have a construct like the following:
if (__builtin_amdgcn_processor_is("gfx900"))) {
  some_gfx9000_specific_intrinsic();
}

some_gfx9000_specific_intrinsic()
We can tell, statically, that the first call is correctly guarded by an if statement: it's guaranteed it will never run on a non-gfx9000 processor. The second call, on the other hand, is not. So we can add a frontend rule: the first call is legal, the second is not. Obviously the error has false positives, in the sense that we can't actually prove the second call is incorrect at runtime... but that's fine, probably.

I will note that on concrete targets, what is being proposed already works as described, by virtue of it being an error to call a builtin that is not available. Having said that, this gives me some trepidation and I think it can end up being user adverse. Consider the following case:

void foo() { 
  if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_gfx9000_specific_intrinsic)) 
    __builtin_amdgcn_gfx9000_specific_intrinsic; 
}

void bar() {
  if (__builtin_amdgcn_processor_is("gfx9000")
    foo();

  foo();
}

We've just made the call to foo() illegal on anything that is not gfx9000, but that builtin / intrinsic could exist in 8999 other gfx versions. These don't always form binary, mutually exclusive structures. So I think I disagree with the "that's fine, probably".

What I don't want is that we end up with, essentially, the same constraint, but enforced by the backend.

Could you please detail why? Ultimately the BE still gets to decide on the legality of things that tweak it pretty intrinsically, even if said things come from otherwise linguistically correct constructs which have passed FE analysis. Also, we'd never really reach the BE, we're just sliding in immediately after Clang, before optimisation, so there's still enough info to provide an useful error message. Furthermore, this might be a better point to check anyways, as linking in bitcode could / should have already occured, so what would otherwise have been external symbols that impact viability would now be satisfied.

I care less about exactly how we solve this because we can adjust the solution later. Whatever we expose in the frontend is much harder to change later.

Between making the wrong choice and going with something that's user adverse early on, then trying to build increasingly complicated mechanisms to make it work, I would prefer we just left these as target specific, low level builtins returning bool, with no convenient Sema guards / errors. This solves actual problems we are aware of / matches uses we have already seen in practice. Target builtins with no safety handles are supposed to be volatile, and unstable specialist tools, so they don't encumber the FE in the same way. At the risk of being repetitive, there's already functionality in upstream that works along the same lines, so there is a precedent. Furthermore, if we derive a superior generic design, it'd hopefully stand on its own merits and would allow and warrant migration.

efriedma-quic · 2025-07-01T06:17:18Z

Definitely, more than happy to have a 1-on-1 (2-on-1 even, since I think @AaronBallman also suggested something along these lines as well :) ).

Please email me with some times that will work for you.

We've just made the call to foo() illegal on anything that is not gfx9000

I... don't think I'm suggesting this? The fact that a call to foo() from a __builtin_amdgcn_processor_is block shouldn't imply anything about other calls to foo().

What I'm basically suggesting is just exposing SPIR-V specialization constants as a C construct. Your example SPIR-V was something like:

%cmp = OpIEqual %bool %runtime_known_hw_id %hw_id_that_supports_feature
if (%cmp = true) {
/* some feature */
} else {
/* other feature */
}

We want to come up with a corresponding C construct that's guaranteed to compile to valid SPIR-V. My suggestion is something like:

if (__runtime_known_hw_id_eq("hw_id_that_supports_feature")) {
  /* some feature */
}

In the body of the if statement, you can use whatever intrinsics are legal on hw_id_that_supports_feature.

we're just sliding in immediately after Clang, before optimisation

Isn't doing checks immediately after IR generation basically the same as checking the AST, just on a slightly different representation?

Add the functional identity and feature queries.

91eeaf0

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. labels Apr 2, 2025

AlexVlx added SPIR-V SPIR-V language support llvm:transforms and removed clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Apr 2, 2025

AlexVlx requested review from arsenm, efriedma-quic, jhuber6, MrSidims, rampitec, rjmccall, sarnex, shiltian and yxsamliu April 2, 2025 02:36

Fix format.

8bf1168

llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Apr 2, 2025

Fix broken patch merge.

3421292

jhuber6 reviewed Apr 2, 2025

View reviewed changes

shiltian reviewed Apr 2, 2025

View reviewed changes

AlexVlx requested review from AaronBallman and erichkeane April 2, 2025 02:46

AaronBallman reviewed Jun 17, 2025

View reviewed changes

jhuber6 mentioned this pull request Jun 17, 2025

Add Option to Enable Dead Block Elimination in NVVMReflect #144171

Open

AlexVlx added 2 commits June 18, 2025 18:18

Adopt suggestions.

c495630

Merge branch 'main' of https://github.com/llvm/llvm-project into zcfs

420a19c

nikic reviewed Jun 23, 2025

View reviewed changes

AlexVlx added 3 commits June 24, 2025 00:41

Implement some of the review suggestions.

dc0221e

Merge branch 'main' of https://github.com/llvm/llvm-project into zcfs

7f88eb7

Clean up unreachable BBs.

3b727b9

		``__amdgpu_feature_predicate_t`` type behaves as an opaque, forward declared
		type with conditional automated conversion to ``_Bool`` when used as the

	auto *FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee());
	const auto *FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee());

		// __amdgpu_feature_predicate_t can be explicitly cast to the logical op
		// type, although this is almost always an error and we advise against it

[AMDGPU][clang][CodeGen][opt] Add late-resolved feature identifying predicates #134016

Are you sure you want to change the base?

[AMDGPU][clang][CodeGen][opt] Add late-resolved feature identifying predicates #134016

Conversation

AlexVlx commented Apr 2, 2025

Uh oh!

llvmbot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Apr 2, 2025

Uh oh!

github-actions bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shiltian left a comment

Choose a reason for hiding this comment

Uh oh!

AlexVlx commented Apr 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexVlx Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikic Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

llvmbot commented Apr 2, 2025 •

edited

Loading

github-actions bot commented Apr 2, 2025 •

edited

Loading

AlexVlx Jun 17, 2025 •

edited

Loading

nikic Jun 24, 2025 •

edited

Loading