-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[AMDGPU][clang][CodeGen][opt] Add late-resolved feature identifying predicates #134016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-clang Author: Alex Voicu (AlexVlx) ChangesThis change adds two semi-magical builtins for AMDGPU:
Neither of these are The motivation for adding these is two-fold:
I've tried to keep the overall footprint of the change small. The changes to Sema are a bit unpleasant, but there was a strong desire to have Clang validate these, and to constrain their uses, and this was the most compact solution I could come up with (suggestions welcome). In the end, I will note there is nothing that is actually AMDGPU specific here, so it is possible that in the future, assuming interests from other targets / users, we'd just promote them to generic intrinsics. Patch is 59.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/134016.diff 17 Files Affected:
diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index 3b8a9cac6587a..8a7cb75af13e5 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -4920,6 +4920,116 @@ If no address spaces names are provided, all address spaces are fenced.
__builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local")
__builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local", "global")
+__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``__builtin_amdgcn_processor_is`` and ``__builtin_amdgcn_is_invocable`` provide
+a functional mechanism for programatically querying:
+
+* the identity of the current target processor;
+* the capability of the current target processor to invoke a particular builtin.
+
+**Syntax**:
+
+.. code-block:: c
+
+ // When used as the predicate for a control structure
+ bool __builtin_amdgcn_processor_is(const char*);
+ bool __builtin_amdgcn_is_invocable(builtin_name);
+ // Otherwise
+ void __builtin_amdgcn_processor_is(const char*);
+ void __builtin_amdgcn_is_invocable(void);
+
+**Example of use**:
+
+.. code-block:: c++
+
+ if (__builtin_amdgcn_processor_is("gfx1201") ||
+ __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var))
+ __builtin_amdgcn_s_sleep_var(x);
+
+ if (!__builtin_amdgcn_processor_is("gfx906"))
+ __builtin_amdgcn_s_wait_event_export_ready();
+ else if (__builtin_amdgcn_processor_is("gfx1010") ||
+ __builtin_amdgcn_processor_is("gfx1101"))
+ __builtin_amdgcn_s_ttracedata_imm(1);
+
+ while (__builtin_amdgcn_processor_is("gfx1101")) *p += x;
+
+ do { *p -= x; } while (__builtin_amdgcn_processor_is("gfx1010"));
+
+ for (; __builtin_amdgcn_processor_is("gfx1201"); ++*p) break;
+
+ if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready))
+ __builtin_amdgcn_s_wait_event_export_ready();
+ else if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_ttracedata_imm))
+ __builtin_amdgcn_s_ttracedata_imm(1);
+
+ do {
+ *p -= x;
+ } while (__builtin_amdgcn_is_invocable(__builtin_amdgcn_global_load_tr_b64_i32));
+
+ for (; __builtin_amdgcn_is_invocable(__builtin_amdgcn_permlane64); ++*p) break;
+
+**Description**:
+
+When used as the predicate value of the following control structures:
+
+.. code-block:: c++
+
+ if (...)
+ while (...)
+ do { } while (...)
+ for (...)
+
+be it directly, or as arguments to logical operators such as ``!, ||, &&``, the
+builtins return a boolean value that:
+
+* indicates whether the current target matches the argument; the argument MUST
+ be a string literal and a valid AMDGPU target
+* indicates whether the builtin function passed as the argument can be invoked
+ by the current target; the argument MUST be either a generic or AMDGPU
+ specific builtin name
+
+Outside of these contexts, the builtins have a ``void`` returning signature
+which prevents their misuse.
+
+**Example of invalid use**:
+
+.. code-block:: c++
+
+ void kernel(int* p, int x, bool (*pfn)(bool), const char* str) {
+ if (__builtin_amdgcn_processor_is("not_an_amdgcn_gfx_id")) return;
+ else if (__builtin_amdgcn_processor_is(str)) __builtin_trap();
+
+ bool a = __builtin_amdgcn_processor_is("gfx906");
+ const bool b = !__builtin_amdgcn_processor_is("gfx906");
+ const bool c = !__builtin_amdgcn_processor_is("gfx906");
+ bool d = __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+ bool e = !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+ const auto f =
+ !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+ || __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+ const auto g =
+ !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+ || !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+ __builtin_amdgcn_processor_is("gfx1201")
+ ? __builtin_amdgcn_s_sleep_var(x) : __builtin_amdgcn_s_sleep(42);
+ if (pfn(__builtin_amdgcn_processor_is("gfx1200")))
+ __builtin_amdgcn_s_sleep_var(x);
+
+ if (__builtin_amdgcn_is_invocable("__builtin_amdgcn_s_sleep_var")) return;
+ else if (__builtin_amdgcn_is_invocable(x)) __builtin_trap();
+ }
+
+When invoked while compiling for a concrete target, the builtins are evaluated
+early by Clang, and never produce any CodeGen effects / have no observable
+side-effects in IR. Conversely, when compiling for AMDGCN flavoured SPIR-v,
+which is an abstract target, a series of predicate values are implicitly
+created. These predicates get resolved when finalizing the compilation process
+for a concrete target, and shall reflect the latter's identity and features.
+Thus, it is possible to author high-level code, in e.g. HIP, that is target
+adaptive in a dynamic fashion, contrary to macro based mechanisms.
ARM/AArch64 Language Extensions
-------------------------------
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 44ef404aee72f..5d01a7e75f7e7 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -346,6 +346,11 @@ BUILTIN(__builtin_amdgcn_endpgm, "v", "nr")
BUILTIN(__builtin_amdgcn_get_fpenv, "WUi", "n")
BUILTIN(__builtin_amdgcn_set_fpenv, "vWUi", "n")
+// These are special FE only builtins intended for forwarding the requirements
+// to the ME.
+BUILTIN(__builtin_amdgcn_processor_is, "vcC*", "nctu")
+BUILTIN(__builtin_amdgcn_is_invocable, "v", "nctu")
+
//===----------------------------------------------------------------------===//
// R600-NI only builtins.
//===----------------------------------------------------------------------===//
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 5e45482584946..45f0f9eb88e55 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -13054,4 +13054,14 @@ def err_acc_decl_for_routine
// AMDGCN builtins diagnostics
def err_amdgcn_global_load_lds_size_invalid_value : Error<"invalid size value">;
def note_amdgcn_global_load_lds_size_valid_value : Note<"size must be %select{1, 2, or 4|1, 2, 4, 12 or 16}0">;
+def err_amdgcn_processor_is_arg_not_literal
+ : Error<"the argument to __builtin_amdgcn_processor_is must be a string "
+ "literal">;
+def err_amdgcn_processor_is_arg_invalid_value
+ : Error<"the argument to __builtin_amdgcn_processor_is must be a valid "
+ "AMDGCN processor identifier; '%0' is not valid">;
+def err_amdgcn_is_invocable_arg_invalid_value
+ : Error<"the argument to __builtin_amdgcn_is_invocable must be either a "
+ "target agnostic builtin or an AMDGCN target specific builtin; `%0`"
+ " is not valid">;
} // end of sema component.
diff --git a/clang/lib/Basic/Targets/SPIR.cpp b/clang/lib/Basic/Targets/SPIR.cpp
index 5b5f47f9647a2..eb43d9b0be283 100644
--- a/clang/lib/Basic/Targets/SPIR.cpp
+++ b/clang/lib/Basic/Targets/SPIR.cpp
@@ -152,3 +152,7 @@ void SPIRV64AMDGCNTargetInfo::setAuxTarget(const TargetInfo *Aux) {
Float128Format = DoubleFormat;
}
}
+
+bool SPIRV64AMDGCNTargetInfo::isValidCPUName(StringRef CPU) const {
+ return AMDGPUTI.isValidCPUName(CPU);
+}
diff --git a/clang/lib/Basic/Targets/SPIR.h b/clang/lib/Basic/Targets/SPIR.h
index 78505d66d6f2f..7aa13cbeb89fd 100644
--- a/clang/lib/Basic/Targets/SPIR.h
+++ b/clang/lib/Basic/Targets/SPIR.h
@@ -432,6 +432,10 @@ class LLVM_LIBRARY_VISIBILITY SPIRV64AMDGCNTargetInfo final
}
bool hasInt128Type() const override { return TargetInfo::hasInt128Type(); }
+
+ // This is only needed for validating arguments passed to
+ // __builtin_amdgcn_processor_is
+ bool isValidCPUName(StringRef Name) const override;
};
} // namespace targets
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index b56b739094ff3..7b1a3815144b4 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -284,6 +284,18 @@ void CodeGenFunction::AddAMDGPUFenceAddressSpaceMMRA(llvm::Instruction *Inst,
Inst->setMetadata(LLVMContext::MD_mmra, MMRAMetadata::getMD(Ctx, MMRAs));
}
+static Value *GetOrInsertAMDGPUPredicate(CodeGenFunction &CGF, Twine Name) {
+ auto PTy = IntegerType::getInt1Ty(CGF.getLLVMContext());
+
+ auto P = cast<GlobalVariable>(
+ CGF.CGM.getModule().getOrInsertGlobal(Name.str(), PTy));
+ P->setConstant(true);
+ P->setExternallyInitialized(true);
+
+ return CGF.Builder.CreateLoad(RawAddress(P, PTy, CharUnits::One(),
+ KnownNonNull));
+}
+
Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
const CallExpr *E) {
llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
@@ -585,6 +597,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
llvm::Value *Env = EmitScalarExpr(E->getArg(0));
return Builder.CreateCall(F, {Env});
}
+ case AMDGPU::BI__builtin_amdgcn_processor_is: {
+ assert(CGM.getTriple().isSPIRV() &&
+ "__builtin_amdgcn_processor_is should never reach CodeGen for "
+ "concrete targets!");
+ StringRef Proc = cast<clang::StringLiteral>(E->getArg(0))->getString();
+ return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.is." + Proc);
+ }
+ case AMDGPU::BI__builtin_amdgcn_is_invocable: {
+ assert(CGM.getTriple().isSPIRV() &&
+ "__builtin_amdgcn_is_invocable should never reach CodeGen for "
+ "concrete targets!");
+ auto FD = cast<FunctionDecl>(
+ cast<DeclRefExpr>(E->getArg(0))->getReferencedDeclOfCallee());
+ StringRef RF =
+ getContext().BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+ return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.has." + RF);
+ }
case AMDGPU::BI__builtin_amdgcn_read_exec:
return EmitAMDGCNBallotForExec(*this, E, Int64Ty, Int64Ty, false);
case AMDGPU::BI__builtin_amdgcn_read_exec_lo:
diff --git a/clang/lib/Sema/SemaExpr.cpp b/clang/lib/Sema/SemaExpr.cpp
index 7cc8374e69d73..24f5262ab3cf4 100644
--- a/clang/lib/Sema/SemaExpr.cpp
+++ b/clang/lib/Sema/SemaExpr.cpp
@@ -6541,6 +6541,22 @@ ExprResult Sema::BuildCallExpr(Scope *Scope, Expr *Fn, SourceLocation LParenLoc,
if (Result.isInvalid()) return ExprError();
Fn = Result.get();
+ // The __builtin_amdgcn_is_invocable builtin is special, and will be resolved
+ // later, when we check boolean conditions, for now we merely forward it
+ // without any additional checking.
+ if (Fn->getType() == Context.BuiltinFnTy && ArgExprs.size() == 1 &&
+ ArgExprs[0]->getType() == Context.BuiltinFnTy) {
+ auto FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee());
+
+ if (FD->getName() == "__builtin_amdgcn_is_invocable") {
+ auto FnPtrTy = Context.getPointerType(FD->getType());
+ auto R = ImpCastExprToType(Fn, FnPtrTy, CK_BuiltinFnToFnPtr).get();
+ return CallExpr::Create(Context, R, ArgExprs, Context.VoidTy,
+ ExprValueKind::VK_PRValue, RParenLoc,
+ FPOptionsOverride());
+ }
+ }
+
if (CheckArgsForPlaceholders(ArgExprs))
return ExprError();
@@ -13234,6 +13250,20 @@ inline QualType Sema::CheckBitwiseOperands(ExprResult &LHS, ExprResult &RHS,
return InvalidOperands(Loc, LHS, RHS);
}
+static inline bool IsAMDGPUPredicateBI(Expr *E) {
+ if (!E->getType()->isVoidType())
+ return false;
+
+ if (auto CE = dyn_cast<CallExpr>(E)) {
+ if (auto BI = CE->getDirectCallee())
+ if (BI->getName() == "__builtin_amdgcn_processor_is" ||
+ BI->getName() == "__builtin_amdgcn_is_invocable")
+ return true;
+ }
+
+ return false;
+}
+
// C99 6.5.[13,14]
inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
SourceLocation Loc,
@@ -13329,6 +13359,9 @@ inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
// The following is safe because we only use this method for
// non-overloadable operands.
+ if (IsAMDGPUPredicateBI(LHS.get()) && IsAMDGPUPredicateBI(RHS.get()))
+ return Context.VoidTy;
+
// C++ [expr.log.and]p1
// C++ [expr.log.or]p1
// The operands are both contextually converted to type bool.
@@ -15576,6 +15609,38 @@ static bool isOverflowingIntegerType(ASTContext &Ctx, QualType T) {
return Ctx.getIntWidth(T) >= Ctx.getIntWidth(Ctx.IntTy);
}
+static Expr *ExpandAMDGPUPredicateBI(ASTContext &Ctx, CallExpr *CE) {
+ if (!CE->getBuiltinCallee())
+ return CXXBoolLiteralExpr::Create(Ctx, false, Ctx.BoolTy, CE->getExprLoc());
+
+ if (Ctx.getTargetInfo().getTriple().isSPIRV()) {
+ CE->setType(Ctx.getLogicalOperationType());
+ return CE;
+ }
+
+ bool P = false;
+ auto &TI = Ctx.getTargetInfo();
+
+ if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+ auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+ auto TID = TI.getTargetID();
+ if (GFX && TID) {
+ auto N = GFX->getString();
+ P = TI.isValidCPUName(GFX->getString()) && TID->find(N) == 0;
+ }
+ } else {
+ auto FD = cast<FunctionDecl>(CE->getArg(0)->getReferencedDeclOfCallee());
+
+ StringRef RF = Ctx.BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+ llvm::StringMap<bool> CF;
+ Ctx.getFunctionFeatureMap(CF, FD);
+
+ P = Builtin::evaluateRequiredTargetFeatures(RF, CF);
+ }
+
+ return CXXBoolLiteralExpr::Create(Ctx, P, Ctx.BoolTy, CE->getExprLoc());
+}
+
ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
UnaryOperatorKind Opc, Expr *InputExpr,
bool IsAfterAmp) {
@@ -15753,6 +15818,8 @@ ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
// Vector logical not returns the signed variant of the operand type.
resultType = GetSignedVectorType(resultType);
break;
+ } else if (IsAMDGPUPredicateBI(InputExpr)) {
+ break;
} else {
return ExprError(Diag(OpLoc, diag::err_typecheck_unary_expr)
<< resultType << Input.get()->getSourceRange());
@@ -20469,6 +20536,88 @@ void Sema::DiagnoseEqualityWithExtraParens(ParenExpr *ParenE) {
}
}
+static bool ValidateAMDGPUPredicateBI(Sema &Sema, CallExpr *CE) {
+ if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+ auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+ if (!GFX) {
+ Sema.Diag(CE->getExprLoc(),
+ diag::err_amdgcn_processor_is_arg_not_literal);
+ return false;
+ }
+ auto N = GFX->getString();
+ if (!Sema.getASTContext().getTargetInfo().isValidCPUName(N) &&
+ (!Sema.getASTContext().getAuxTargetInfo() ||
+ !Sema.getASTContext().getAuxTargetInfo()->isValidCPUName(N))) {
+ Sema.Diag(CE->getExprLoc(),
+ diag::err_amdgcn_processor_is_arg_invalid_value) << N;
+ return false;
+ }
+ } else {
+ auto Arg = CE->getArg(0);
+ if (!Arg || Arg->getType() != Sema.getASTContext().BuiltinFnTy) {
+ Sema.Diag(CE->getExprLoc(),
+ diag::err_amdgcn_is_invocable_arg_invalid_value) << Arg;
+ return false;
+ }
+ }
+
+ return true;
+}
+
+static Expr *MaybeHandleAMDGPUPredicateBI(Sema &Sema, Expr *E, bool &Invalid) {
+ if (auto UO = dyn_cast<UnaryOperator>(E)) {
+ auto SE = dyn_cast<CallExpr>(UO->getSubExpr());
+ if (IsAMDGPUPredicateBI(SE)) {
+ assert(
+ UO->getOpcode() == UnaryOperator::Opcode::UO_LNot &&
+ "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+ "can only be used as operands of logical ops!");
+
+ if (!ValidateAMDGPUPredicateBI(Sema, SE)) {
+ Invalid = true;
+ return nullptr;
+ }
+
+ UO->setSubExpr(ExpandAMDGPUPredicateBI(Sema.getASTContext(), SE));
+ UO->setType(Sema.getASTContext().getLogicalOperationType());
+
+ return UO;
+ }
+ }
+ if (auto BO = dyn_cast<BinaryOperator>(E)) {
+ auto LHS = dyn_cast<CallExpr>(BO->getLHS());
+ auto RHS = dyn_cast<CallExpr>(BO->getRHS());
+ if (IsAMDGPUPredicateBI(LHS) && IsAMDGPUPredicateBI(RHS)) {
+ assert(
+ BO->isLogicalOp() &&
+ "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+ "can only be used as operands of logical ops!");
+
+ if (!ValidateAMDGPUPredicateBI(Sema, LHS) ||
+ !ValidateAMDGPUPredicateBI(Sema, RHS)) {
+ Invalid = true;
+ return nullptr;
+ }
+
+ BO->setLHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), LHS));
+ BO->setRHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), RHS));
+ BO->setType(Sema.getASTContext().getLogicalOperationType());
+
+ return BO;
+ }
+ }
+ if (auto CE = dyn_cast<CallExpr>(E))
+ if (IsAMDGPUPredicateBI(CE)) {
+ if (!ValidateAMDGPUPredicateBI(Sema, CE)) {
+ Invalid = true;
+ return nullptr;
+ }
+ return ExpandAMDGPUPredicateBI(Sema.getASTContext(), CE);
+ }
+
+ return nullptr;
+}
+
ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
bool IsConstexpr) {
DiagnoseAssignmentAsCondition(E);
@@ -20480,6 +20629,14 @@ ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
E = result.get();
if (!E->isTypeDependent()) {
+ if (E->getType()->isVoidType()) {
+ bool IsInvalidPredicate = false;
+ if (auto BIC = MaybeHandleAMDGPUPredicateBI(*this, E, IsInvalidPredicate))
+ return BIC;
+ else if (IsInvalidPredicate)
+ return ExprError();
+ }
+
if (getLangOpts().CPlusPlus)
return CheckCXXBooleanCondition(E, IsConstexpr); // C++ 6.4p4
diff --git a/clang/test/CodeGen/amdgpu-builtin-cpu-is.c b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
new file mode 100644
index 0000000000000..6e261d9f5d239
--- /dev/null
+++ b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
@@ -0,0 +1,65 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals all --version 5
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx900 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX900 %s
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx1010 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX1010 %s
+// RUN: %clang_cc1 -triple spirv64-amd-amdhsa -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCNSPIRV %s
+
+// Test that, depending on triple and, if applicable, target-cpu, one of three
+// things happens:
+// 1) for gfx900 we emit a call to trap (concrete target, matches)
+// 2) for gfx1010 we emit an empty kernel (concrete target, does not match)
+// 3) for AMDGCNSPIRV we emit llvm.amdgcn.is.gfx900 as a bool global, and
+// load from it to provide the condition a br (abstract target)
+//.
+// AMDGCN-GFX900: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCN-GFX1010: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCNSPIRV: @llvm.amdgcn.is.gfx900 = external addrspace(1) externally_initialized constant i1
+//.
+// AMDGCN-GFX900-LABEL: define dso_local void @foo(
+// AMDGCN-GFX900-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX900-NEXT: [[ENTRY:.*:]]
+// AMDGCN-GFX900-NEXT: call void @llvm.trap()
+// AMDGCN-GFX900-NEXT: ret void
+//
+// AMDGCN-GFX1010-LABEL: define dso_local void @foo(
+// AMDGCN-GFX1010-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX1010-NEXT: [[ENTRY:.*:]]
+// AMDGCN-GFX1010-NEXT: ret void
+//
+// AMDGCNSPIRV-LABEL: define spir_func void @foo(
+// AMDGCNSPIRV-SAME: ) addrspace(4) #[[ATTR0:[0-9]+]] {
+// AMDGCNSPIRV-NEXT: [[ENTRY:.*:]]
+// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = load i1, ptr addrspace(1) @llvm.amdgcn.is.gfx900, align 1
+// AMDGCNSPIRV-NEXT: br i1 [[TMP0]], label %[[IF_THEN:.*]], label %[[IF_END:.*]]
+// AMDGCNSPIRV: [[IF_THEN]]:
+// AMDGCNSPIRV-NEXT: call addrspace(4) void @llvm.trap()
+// AMDGCNSPIRV-NEXT: br label %[[IF_END]]
+// AMDGCNSPIRV: [[IF_END]]:
+// AMDGCNSPIRV-NEXT: ret void
+//
+void foo() {
+ if (__builtin_cpu_is("gfx90...
[truncated]
|
@llvm/pr-subscribers-clang-codegen Author: Alex Voicu (AlexVlx) ChangesThis change adds two semi-magical builtins for AMDGPU:
Neither of these are The motivation for adding these is two-fold:
I've tried to keep the overall footprint of the change small. The changes to Sema are a bit unpleasant, but there was a strong desire to have Clang validate these, and to constrain their uses, and this was the most compact solution I could come up with (suggestions welcome). In the end, I will note there is nothing that is actually AMDGPU specific here, so it is possible that in the future, assuming interests from other targets / users, we'd just promote them to generic intrinsics. Patch is 59.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/134016.diff 17 Files Affected:
diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index 3b8a9cac6587a..8a7cb75af13e5 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -4920,6 +4920,116 @@ If no address spaces names are provided, all address spaces are fenced.
__builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local")
__builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local", "global")
+__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``__builtin_amdgcn_processor_is`` and ``__builtin_amdgcn_is_invocable`` provide
+a functional mechanism for programatically querying:
+
+* the identity of the current target processor;
+* the capability of the current target processor to invoke a particular builtin.
+
+**Syntax**:
+
+.. code-block:: c
+
+ // When used as the predicate for a control structure
+ bool __builtin_amdgcn_processor_is(const char*);
+ bool __builtin_amdgcn_is_invocable(builtin_name);
+ // Otherwise
+ void __builtin_amdgcn_processor_is(const char*);
+ void __builtin_amdgcn_is_invocable(void);
+
+**Example of use**:
+
+.. code-block:: c++
+
+ if (__builtin_amdgcn_processor_is("gfx1201") ||
+ __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var))
+ __builtin_amdgcn_s_sleep_var(x);
+
+ if (!__builtin_amdgcn_processor_is("gfx906"))
+ __builtin_amdgcn_s_wait_event_export_ready();
+ else if (__builtin_amdgcn_processor_is("gfx1010") ||
+ __builtin_amdgcn_processor_is("gfx1101"))
+ __builtin_amdgcn_s_ttracedata_imm(1);
+
+ while (__builtin_amdgcn_processor_is("gfx1101")) *p += x;
+
+ do { *p -= x; } while (__builtin_amdgcn_processor_is("gfx1010"));
+
+ for (; __builtin_amdgcn_processor_is("gfx1201"); ++*p) break;
+
+ if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready))
+ __builtin_amdgcn_s_wait_event_export_ready();
+ else if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_ttracedata_imm))
+ __builtin_amdgcn_s_ttracedata_imm(1);
+
+ do {
+ *p -= x;
+ } while (__builtin_amdgcn_is_invocable(__builtin_amdgcn_global_load_tr_b64_i32));
+
+ for (; __builtin_amdgcn_is_invocable(__builtin_amdgcn_permlane64); ++*p) break;
+
+**Description**:
+
+When used as the predicate value of the following control structures:
+
+.. code-block:: c++
+
+ if (...)
+ while (...)
+ do { } while (...)
+ for (...)
+
+be it directly, or as arguments to logical operators such as ``!, ||, &&``, the
+builtins return a boolean value that:
+
+* indicates whether the current target matches the argument; the argument MUST
+ be a string literal and a valid AMDGPU target
+* indicates whether the builtin function passed as the argument can be invoked
+ by the current target; the argument MUST be either a generic or AMDGPU
+ specific builtin name
+
+Outside of these contexts, the builtins have a ``void`` returning signature
+which prevents their misuse.
+
+**Example of invalid use**:
+
+.. code-block:: c++
+
+ void kernel(int* p, int x, bool (*pfn)(bool), const char* str) {
+ if (__builtin_amdgcn_processor_is("not_an_amdgcn_gfx_id")) return;
+ else if (__builtin_amdgcn_processor_is(str)) __builtin_trap();
+
+ bool a = __builtin_amdgcn_processor_is("gfx906");
+ const bool b = !__builtin_amdgcn_processor_is("gfx906");
+ const bool c = !__builtin_amdgcn_processor_is("gfx906");
+ bool d = __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+ bool e = !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+ const auto f =
+ !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+ || __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+ const auto g =
+ !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+ || !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+ __builtin_amdgcn_processor_is("gfx1201")
+ ? __builtin_amdgcn_s_sleep_var(x) : __builtin_amdgcn_s_sleep(42);
+ if (pfn(__builtin_amdgcn_processor_is("gfx1200")))
+ __builtin_amdgcn_s_sleep_var(x);
+
+ if (__builtin_amdgcn_is_invocable("__builtin_amdgcn_s_sleep_var")) return;
+ else if (__builtin_amdgcn_is_invocable(x)) __builtin_trap();
+ }
+
+When invoked while compiling for a concrete target, the builtins are evaluated
+early by Clang, and never produce any CodeGen effects / have no observable
+side-effects in IR. Conversely, when compiling for AMDGCN flavoured SPIR-v,
+which is an abstract target, a series of predicate values are implicitly
+created. These predicates get resolved when finalizing the compilation process
+for a concrete target, and shall reflect the latter's identity and features.
+Thus, it is possible to author high-level code, in e.g. HIP, that is target
+adaptive in a dynamic fashion, contrary to macro based mechanisms.
ARM/AArch64 Language Extensions
-------------------------------
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 44ef404aee72f..5d01a7e75f7e7 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -346,6 +346,11 @@ BUILTIN(__builtin_amdgcn_endpgm, "v", "nr")
BUILTIN(__builtin_amdgcn_get_fpenv, "WUi", "n")
BUILTIN(__builtin_amdgcn_set_fpenv, "vWUi", "n")
+// These are special FE only builtins intended for forwarding the requirements
+// to the ME.
+BUILTIN(__builtin_amdgcn_processor_is, "vcC*", "nctu")
+BUILTIN(__builtin_amdgcn_is_invocable, "v", "nctu")
+
//===----------------------------------------------------------------------===//
// R600-NI only builtins.
//===----------------------------------------------------------------------===//
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 5e45482584946..45f0f9eb88e55 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -13054,4 +13054,14 @@ def err_acc_decl_for_routine
// AMDGCN builtins diagnostics
def err_amdgcn_global_load_lds_size_invalid_value : Error<"invalid size value">;
def note_amdgcn_global_load_lds_size_valid_value : Note<"size must be %select{1, 2, or 4|1, 2, 4, 12 or 16}0">;
+def err_amdgcn_processor_is_arg_not_literal
+ : Error<"the argument to __builtin_amdgcn_processor_is must be a string "
+ "literal">;
+def err_amdgcn_processor_is_arg_invalid_value
+ : Error<"the argument to __builtin_amdgcn_processor_is must be a valid "
+ "AMDGCN processor identifier; '%0' is not valid">;
+def err_amdgcn_is_invocable_arg_invalid_value
+ : Error<"the argument to __builtin_amdgcn_is_invocable must be either a "
+ "target agnostic builtin or an AMDGCN target specific builtin; `%0`"
+ " is not valid">;
} // end of sema component.
diff --git a/clang/lib/Basic/Targets/SPIR.cpp b/clang/lib/Basic/Targets/SPIR.cpp
index 5b5f47f9647a2..eb43d9b0be283 100644
--- a/clang/lib/Basic/Targets/SPIR.cpp
+++ b/clang/lib/Basic/Targets/SPIR.cpp
@@ -152,3 +152,7 @@ void SPIRV64AMDGCNTargetInfo::setAuxTarget(const TargetInfo *Aux) {
Float128Format = DoubleFormat;
}
}
+
+bool SPIRV64AMDGCNTargetInfo::isValidCPUName(StringRef CPU) const {
+ return AMDGPUTI.isValidCPUName(CPU);
+}
diff --git a/clang/lib/Basic/Targets/SPIR.h b/clang/lib/Basic/Targets/SPIR.h
index 78505d66d6f2f..7aa13cbeb89fd 100644
--- a/clang/lib/Basic/Targets/SPIR.h
+++ b/clang/lib/Basic/Targets/SPIR.h
@@ -432,6 +432,10 @@ class LLVM_LIBRARY_VISIBILITY SPIRV64AMDGCNTargetInfo final
}
bool hasInt128Type() const override { return TargetInfo::hasInt128Type(); }
+
+ // This is only needed for validating arguments passed to
+ // __builtin_amdgcn_processor_is
+ bool isValidCPUName(StringRef Name) const override;
};
} // namespace targets
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index b56b739094ff3..7b1a3815144b4 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -284,6 +284,18 @@ void CodeGenFunction::AddAMDGPUFenceAddressSpaceMMRA(llvm::Instruction *Inst,
Inst->setMetadata(LLVMContext::MD_mmra, MMRAMetadata::getMD(Ctx, MMRAs));
}
+static Value *GetOrInsertAMDGPUPredicate(CodeGenFunction &CGF, Twine Name) {
+ auto PTy = IntegerType::getInt1Ty(CGF.getLLVMContext());
+
+ auto P = cast<GlobalVariable>(
+ CGF.CGM.getModule().getOrInsertGlobal(Name.str(), PTy));
+ P->setConstant(true);
+ P->setExternallyInitialized(true);
+
+ return CGF.Builder.CreateLoad(RawAddress(P, PTy, CharUnits::One(),
+ KnownNonNull));
+}
+
Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
const CallExpr *E) {
llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
@@ -585,6 +597,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
llvm::Value *Env = EmitScalarExpr(E->getArg(0));
return Builder.CreateCall(F, {Env});
}
+ case AMDGPU::BI__builtin_amdgcn_processor_is: {
+ assert(CGM.getTriple().isSPIRV() &&
+ "__builtin_amdgcn_processor_is should never reach CodeGen for "
+ "concrete targets!");
+ StringRef Proc = cast<clang::StringLiteral>(E->getArg(0))->getString();
+ return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.is." + Proc);
+ }
+ case AMDGPU::BI__builtin_amdgcn_is_invocable: {
+ assert(CGM.getTriple().isSPIRV() &&
+ "__builtin_amdgcn_is_invocable should never reach CodeGen for "
+ "concrete targets!");
+ auto FD = cast<FunctionDecl>(
+ cast<DeclRefExpr>(E->getArg(0))->getReferencedDeclOfCallee());
+ StringRef RF =
+ getContext().BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+ return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.has." + RF);
+ }
case AMDGPU::BI__builtin_amdgcn_read_exec:
return EmitAMDGCNBallotForExec(*this, E, Int64Ty, Int64Ty, false);
case AMDGPU::BI__builtin_amdgcn_read_exec_lo:
diff --git a/clang/lib/Sema/SemaExpr.cpp b/clang/lib/Sema/SemaExpr.cpp
index 7cc8374e69d73..24f5262ab3cf4 100644
--- a/clang/lib/Sema/SemaExpr.cpp
+++ b/clang/lib/Sema/SemaExpr.cpp
@@ -6541,6 +6541,22 @@ ExprResult Sema::BuildCallExpr(Scope *Scope, Expr *Fn, SourceLocation LParenLoc,
if (Result.isInvalid()) return ExprError();
Fn = Result.get();
+ // The __builtin_amdgcn_is_invocable builtin is special, and will be resolved
+ // later, when we check boolean conditions, for now we merely forward it
+ // without any additional checking.
+ if (Fn->getType() == Context.BuiltinFnTy && ArgExprs.size() == 1 &&
+ ArgExprs[0]->getType() == Context.BuiltinFnTy) {
+ auto FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee());
+
+ if (FD->getName() == "__builtin_amdgcn_is_invocable") {
+ auto FnPtrTy = Context.getPointerType(FD->getType());
+ auto R = ImpCastExprToType(Fn, FnPtrTy, CK_BuiltinFnToFnPtr).get();
+ return CallExpr::Create(Context, R, ArgExprs, Context.VoidTy,
+ ExprValueKind::VK_PRValue, RParenLoc,
+ FPOptionsOverride());
+ }
+ }
+
if (CheckArgsForPlaceholders(ArgExprs))
return ExprError();
@@ -13234,6 +13250,20 @@ inline QualType Sema::CheckBitwiseOperands(ExprResult &LHS, ExprResult &RHS,
return InvalidOperands(Loc, LHS, RHS);
}
+static inline bool IsAMDGPUPredicateBI(Expr *E) {
+ if (!E->getType()->isVoidType())
+ return false;
+
+ if (auto CE = dyn_cast<CallExpr>(E)) {
+ if (auto BI = CE->getDirectCallee())
+ if (BI->getName() == "__builtin_amdgcn_processor_is" ||
+ BI->getName() == "__builtin_amdgcn_is_invocable")
+ return true;
+ }
+
+ return false;
+}
+
// C99 6.5.[13,14]
inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
SourceLocation Loc,
@@ -13329,6 +13359,9 @@ inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
// The following is safe because we only use this method for
// non-overloadable operands.
+ if (IsAMDGPUPredicateBI(LHS.get()) && IsAMDGPUPredicateBI(RHS.get()))
+ return Context.VoidTy;
+
// C++ [expr.log.and]p1
// C++ [expr.log.or]p1
// The operands are both contextually converted to type bool.
@@ -15576,6 +15609,38 @@ static bool isOverflowingIntegerType(ASTContext &Ctx, QualType T) {
return Ctx.getIntWidth(T) >= Ctx.getIntWidth(Ctx.IntTy);
}
+static Expr *ExpandAMDGPUPredicateBI(ASTContext &Ctx, CallExpr *CE) {
+ if (!CE->getBuiltinCallee())
+ return CXXBoolLiteralExpr::Create(Ctx, false, Ctx.BoolTy, CE->getExprLoc());
+
+ if (Ctx.getTargetInfo().getTriple().isSPIRV()) {
+ CE->setType(Ctx.getLogicalOperationType());
+ return CE;
+ }
+
+ bool P = false;
+ auto &TI = Ctx.getTargetInfo();
+
+ if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+ auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+ auto TID = TI.getTargetID();
+ if (GFX && TID) {
+ auto N = GFX->getString();
+ P = TI.isValidCPUName(GFX->getString()) && TID->find(N) == 0;
+ }
+ } else {
+ auto FD = cast<FunctionDecl>(CE->getArg(0)->getReferencedDeclOfCallee());
+
+ StringRef RF = Ctx.BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+ llvm::StringMap<bool> CF;
+ Ctx.getFunctionFeatureMap(CF, FD);
+
+ P = Builtin::evaluateRequiredTargetFeatures(RF, CF);
+ }
+
+ return CXXBoolLiteralExpr::Create(Ctx, P, Ctx.BoolTy, CE->getExprLoc());
+}
+
ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
UnaryOperatorKind Opc, Expr *InputExpr,
bool IsAfterAmp) {
@@ -15753,6 +15818,8 @@ ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
// Vector logical not returns the signed variant of the operand type.
resultType = GetSignedVectorType(resultType);
break;
+ } else if (IsAMDGPUPredicateBI(InputExpr)) {
+ break;
} else {
return ExprError(Diag(OpLoc, diag::err_typecheck_unary_expr)
<< resultType << Input.get()->getSourceRange());
@@ -20469,6 +20536,88 @@ void Sema::DiagnoseEqualityWithExtraParens(ParenExpr *ParenE) {
}
}
+static bool ValidateAMDGPUPredicateBI(Sema &Sema, CallExpr *CE) {
+ if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+ auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+ if (!GFX) {
+ Sema.Diag(CE->getExprLoc(),
+ diag::err_amdgcn_processor_is_arg_not_literal);
+ return false;
+ }
+ auto N = GFX->getString();
+ if (!Sema.getASTContext().getTargetInfo().isValidCPUName(N) &&
+ (!Sema.getASTContext().getAuxTargetInfo() ||
+ !Sema.getASTContext().getAuxTargetInfo()->isValidCPUName(N))) {
+ Sema.Diag(CE->getExprLoc(),
+ diag::err_amdgcn_processor_is_arg_invalid_value) << N;
+ return false;
+ }
+ } else {
+ auto Arg = CE->getArg(0);
+ if (!Arg || Arg->getType() != Sema.getASTContext().BuiltinFnTy) {
+ Sema.Diag(CE->getExprLoc(),
+ diag::err_amdgcn_is_invocable_arg_invalid_value) << Arg;
+ return false;
+ }
+ }
+
+ return true;
+}
+
+static Expr *MaybeHandleAMDGPUPredicateBI(Sema &Sema, Expr *E, bool &Invalid) {
+ if (auto UO = dyn_cast<UnaryOperator>(E)) {
+ auto SE = dyn_cast<CallExpr>(UO->getSubExpr());
+ if (IsAMDGPUPredicateBI(SE)) {
+ assert(
+ UO->getOpcode() == UnaryOperator::Opcode::UO_LNot &&
+ "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+ "can only be used as operands of logical ops!");
+
+ if (!ValidateAMDGPUPredicateBI(Sema, SE)) {
+ Invalid = true;
+ return nullptr;
+ }
+
+ UO->setSubExpr(ExpandAMDGPUPredicateBI(Sema.getASTContext(), SE));
+ UO->setType(Sema.getASTContext().getLogicalOperationType());
+
+ return UO;
+ }
+ }
+ if (auto BO = dyn_cast<BinaryOperator>(E)) {
+ auto LHS = dyn_cast<CallExpr>(BO->getLHS());
+ auto RHS = dyn_cast<CallExpr>(BO->getRHS());
+ if (IsAMDGPUPredicateBI(LHS) && IsAMDGPUPredicateBI(RHS)) {
+ assert(
+ BO->isLogicalOp() &&
+ "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+ "can only be used as operands of logical ops!");
+
+ if (!ValidateAMDGPUPredicateBI(Sema, LHS) ||
+ !ValidateAMDGPUPredicateBI(Sema, RHS)) {
+ Invalid = true;
+ return nullptr;
+ }
+
+ BO->setLHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), LHS));
+ BO->setRHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), RHS));
+ BO->setType(Sema.getASTContext().getLogicalOperationType());
+
+ return BO;
+ }
+ }
+ if (auto CE = dyn_cast<CallExpr>(E))
+ if (IsAMDGPUPredicateBI(CE)) {
+ if (!ValidateAMDGPUPredicateBI(Sema, CE)) {
+ Invalid = true;
+ return nullptr;
+ }
+ return ExpandAMDGPUPredicateBI(Sema.getASTContext(), CE);
+ }
+
+ return nullptr;
+}
+
ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
bool IsConstexpr) {
DiagnoseAssignmentAsCondition(E);
@@ -20480,6 +20629,14 @@ ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
E = result.get();
if (!E->isTypeDependent()) {
+ if (E->getType()->isVoidType()) {
+ bool IsInvalidPredicate = false;
+ if (auto BIC = MaybeHandleAMDGPUPredicateBI(*this, E, IsInvalidPredicate))
+ return BIC;
+ else if (IsInvalidPredicate)
+ return ExprError();
+ }
+
if (getLangOpts().CPlusPlus)
return CheckCXXBooleanCondition(E, IsConstexpr); // C++ 6.4p4
diff --git a/clang/test/CodeGen/amdgpu-builtin-cpu-is.c b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
new file mode 100644
index 0000000000000..6e261d9f5d239
--- /dev/null
+++ b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
@@ -0,0 +1,65 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals all --version 5
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx900 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX900 %s
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx1010 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX1010 %s
+// RUN: %clang_cc1 -triple spirv64-amd-amdhsa -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCNSPIRV %s
+
+// Test that, depending on triple and, if applicable, target-cpu, one of three
+// things happens:
+// 1) for gfx900 we emit a call to trap (concrete target, matches)
+// 2) for gfx1010 we emit an empty kernel (concrete target, does not match)
+// 3) for AMDGCNSPIRV we emit llvm.amdgcn.is.gfx900 as a bool global, and
+// load from it to provide the condition a br (abstract target)
+//.
+// AMDGCN-GFX900: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCN-GFX1010: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCNSPIRV: @llvm.amdgcn.is.gfx900 = external addrspace(1) externally_initialized constant i1
+//.
+// AMDGCN-GFX900-LABEL: define dso_local void @foo(
+// AMDGCN-GFX900-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX900-NEXT: [[ENTRY:.*:]]
+// AMDGCN-GFX900-NEXT: call void @llvm.trap()
+// AMDGCN-GFX900-NEXT: ret void
+//
+// AMDGCN-GFX1010-LABEL: define dso_local void @foo(
+// AMDGCN-GFX1010-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX1010-NEXT: [[ENTRY:.*:]]
+// AMDGCN-GFX1010-NEXT: ret void
+//
+// AMDGCNSPIRV-LABEL: define spir_func void @foo(
+// AMDGCNSPIRV-SAME: ) addrspace(4) #[[ATTR0:[0-9]+]] {
+// AMDGCNSPIRV-NEXT: [[ENTRY:.*:]]
+// AMDGCNSPIRV-NEXT: [[TMP0:%.*]] = load i1, ptr addrspace(1) @llvm.amdgcn.is.gfx900, align 1
+// AMDGCNSPIRV-NEXT: br i1 [[TMP0]], label %[[IF_THEN:.*]], label %[[IF_END:.*]]
+// AMDGCNSPIRV: [[IF_THEN]]:
+// AMDGCNSPIRV-NEXT: call addrspace(4) void @llvm.trap()
+// AMDGCNSPIRV-NEXT: br label %[[IF_END]]
+// AMDGCNSPIRV: [[IF_END]]:
+// AMDGCNSPIRV-NEXT: ret void
+//
+void foo() {
+ if (__builtin_cpu_is("gfx90...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, in general I'm a fan of being able to use LLVM-IR as a more general target. We already hack around these things in practice, so I think it's only beneficial to formalize is in a more correct way, even if LLVM-IR wasn't 'strictly' intended to be this kind of serialization format.
// AMDGCNSPIRV-NEXT: ret void | ||
// | ||
void foo() { | ||
if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_permlanex16)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended to handle builtins that require certain target features to be set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get a test? Something simple like +dpp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get a test? Something simple like
+dpp
?
Sure, but if possible, could you clarify what you would like to be tested / what you expect to see, so that we avoid churning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with how the ROCm device libs does it, is that certain builtins require target features to be used. It hacks around this with the __attribute__((target))
. I just want to know that you can call a builtin that requires +ddp
features without that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is worth a release note item.
Indeed! I botched moving the changes from my internal scratchpad, and the rel notes got lost; fixing. |
a functional mechanism for programatically querying: | ||
|
||
* the identity of the current target processor; | ||
* the capability of the current target processor to invoke a particular builtin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``__amdgpu_feature_predicate_t`` type behaves as an opaque, forward declared | ||
type with conditional automated conversion to ``_Bool`` when used as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a test case that stuff like this is diagnosed as using an incomplete type?
typeof(__builtin_amdgcn_processor_is("gfx900")) what;
Similar in C++ with decltype
?
clang/lib/Sema/SemaExpr.cpp
Outdated
// without any additional checking. | ||
if (Fn->getType() == Context.BuiltinFnTy && ArgExprs.size() == 1 && | ||
ArgExprs[0]->getType() == Context.BuiltinFnTy) { | ||
auto *FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto *FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee()); | |
const auto *FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee()); |
clang/lib/Sema/SemaInit.cpp
Outdated
// __amdgpu_feature_predicate_t can be explicitly cast to the logical op | ||
// type, although this is almost always an error and we advise against it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// __amdgpu_feature_predicate_t can be explicitly cast to the logical op | |
// type, although this is almost always an error and we advise against it | |
// __amdgpu_feature_predicate_t can be explicitly cast to the logical op | |
// type, although this is almost always an error and we advise against it. |
I'm a bit confused; the comment says it's allowed but implies we should at least warn on it. But we're emitting an error diagnostic for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is trying to do this without a cast, which is what is being tested (this is the same behaviour one gets around a type that has an explicit
conversion operator for bool). The error message is trying to inform the user that if they REALLY want to do this, they can, but they have to explicitly cast via C-cast or C++ static_cast
, however they probably should not as it is almost always going to lead to an error. I am happy to reword the error message / add a warning on casting from __amdgpu_feature_predicate_t
to bool
/ _Bool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see the logic now, I just didn't read hard enough. :-D
So why would the explicit cast almost always lead to an error given that we're doing the cast implicitly for them anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see the logic now, I just didn't read hard enough. :-D
So why would the explicit cast almost always lead to an error given that we're doing the cast implicitly for them anyway?
Context:) Since we're modelling this after a type with an explicit operator bool()
, which is not regular (no default, copy or move ctors) only contextual conversions are valid i.e. (off the top of my head, I'd have to look up the exact reference in the standard):
- controlling expression for
if
,while
,for
; - first operand of the ternary / conditional operator
?
; - operands for guilt-in logical operators;
- some
constexpr
/consteval
cases that do not apply.
So, we do want if (__builtin_amdgcn_processor_is("gfx900")
to just work as typed, as this is an intended usage pattern (user checks for a particular target, then does something that is target specific on the true
branch).
However, if your expression is not contextually convertible to bool
/_Bool
, you are probably walking into trouble by doing something potentially nefarious like this:
void foo(const std::vector<bool>& ps) {
// do stuff based on indexing into the predicate vector, e.g.
// assume that the 0th element holds gfx900, the 1st holds
// gfx901 etc.
}
void bar() {
vector<bool> ps;
ps.push_back(__builtin_amdgcn_processor_is("gfx900");
ps.push_back(__builtin_amdgcn_processor_is("gfx901");
}
which might make a lot of internal sense for the client app, and looks pretty harmless, but is
not something we can support in the limit (once it's buried under 5 additional layers of indirection, at -O0
etc.), and which is likely to end up a bug farm. These really do have sharp edges so the intention is to be defensive by default and ensure that it's very hard to step on a landmine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm still missing something. If
if (__builtin_amdgcn_processor_is("gfx900"))
ps.push_back(true);
works.... why would
ps.push_back(__builtin_amdgcn_processor_is("gfx900"));
be dangerous?
but is not something we can support in the limit (once it's buried under 5 additional layers of indirection, at -O0 etc.)
Why would optimization modes matter?
Apologies if these are dumb questions on my part. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take these piecewise. Your first example actually works / those are equivalent. I think the danger here is assuming that the sort of easy to type examples we are playing with are representative of where issues show up - they are not. The cases where things break down are somewhat more intricate - I chose pushing into a container via a function with side-effects on purpose.
I suspect that the sense that something is tied to optimiser behaviour is due to my reply above, which perhaps was insufficiently clear - apologies. I was trying to explain why making it trivial to store these as bool
eans somewhere leads to having to run large parts of the optimisation pipeline. There is no dependence on the optimiser, O0
and Ox
behave in the same way in what regards the predicates, because we have a dedicated pass that unconditionally runs early in the pipeline, irrespective of optimisation level, and either succeeds at the needed folding or fails and diagnoses.
The __has_builtin counter-example actually does not work and cannot work, please see: https://gcc.godbolt.org/z/7G5Y1d85b. It fact, it's the essence of why we need these, the fact that that pattern does not work and cannot work, and yet it is extremely useful. Those situations are materially different because:
- this is not about calling some generic omni-available code, it's about calling target specific code - this has to be statically decided on in the compiler, we MUST know if the target can run it or not, which is why this is a target specific BI;
- furthermore, the
...later...
bit is pretty important: what happens on that path? do you pass the boolean by reference into anextern
function which gets linked in at run time (i.e. no idea what it does)? do you mutate the value based on a run time value? if you do any of those, your distanthas_builtin
variable no longer reflects the predicate, which is the issue; - the answer to the above bit might be "make it
constexpr
" - sure, but then it rolls back into not working for abstract targets / resolving these late, which is the gap in functionality with things like the__has_builtin
macro that these try to fill.
I think the default expectation is that you should be able to query the processor information at any point you want, store the results anywhere you want, and use them later with the expected semantics
- I don't think this is actually the case, unless what you are thinking about is __builtin_cpu_is
, which is a different mechanism that operates at execution time.
Overall, this might be less profound / convoluted than we've made it seem:
- use the predicates as intended, things work;
- explicitly cast to
bool
and then stash:
a) if the chain formed from the point of cast to the final point of use can be folded in a terminator, for all uses of
the cast, happy days;
b) if for a chain from point of cast to final point of use folding fails (because you passed your value to an
opaque function, modified it based on a run time value etc.), you get an error and a diagnostic.
This is independent from optimisation level, and essentially matches what you would have to do with __has_builtin
as well (except you'd have to make the stashed variable constexpr
and then make the control structure be something like if constexpr
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the sense that something is tied to optimiser behaviour is due to my reply above, which perhaps was insufficiently clear - apologies.
No worries, this is complex stuff! I appreciate your willingness to talk me through it. :-)
The __has_builtin counter-example actually does not work and cannot work, please see: https://gcc.godbolt.org/z/7G5Y1d85b.
I cannot imagine a situation in which this isn't indicative of a bug, but perhaps this situation is the same one that necessitated this PR which eventually concluded that we should change the behavior of __has_builtin
rather than introduce a new builtin.
furthermore, the ...later... bit is pretty important: what happens on that path?
Anything in the world besides changing the value of has_builtin
to something other than what __has_builtin
returned.
if you do any of those, your distant has_builtin variable no longer reflects the predicate, which is the issue;
Why is that an issue? If the variable no longer reflects the predicate, that's not on the compiler to figure out how to deal with, that's "play silly games, win silly prizes".
Backing up a step.. my expectation is that this eventually lowers down to a test and jump which jumps past the target code if the test fails. e.g.,
%0 = load i8, ptr %b, align 1
%loadedv = trunc i8 %0 to i1
br i1 %loadedv, label %if.then, label %if.end
if.then:
# the target-specific instructions live here
br label %if.end
if.end:
ret void
So we'd be generating instructions for the target which may be invalid if the test lies. If something did change that value so it no longer represents the predicate, I think that's UB (and we could help users catch that UB via a sanitizer check if we wanted to, rather than try to make the backend have to try to figure it out at compile time).
if for a chain from point of cast to final point of use folding fails (because you passed your value to an
opaque function, modified it based on a run time value etc.), you get an error and a diagnostic.
I was thinking you would not get a diagnostic; you'd get the behavior you asked for, which may be utter nonsense.
Am I missing something still? If so, maybe it would be quicker for us to hop in a telecon call? I'm going to be out of the office until Monday, but I'm happy to meet with you if that's more productive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __has_builtin counter-example actually does not work and cannot work, please see: https://gcc.godbolt.org/z/7G5Y1d85b.
I cannot imagine a situation in which this isn't indicative of a bug, but perhaps this situation is the same one that necessitated this PR which eventually concluded that we should change the behavior of
__has_builtin
rather than introduce a new builtin.
This is not actually a bug, it's intended behaviour. To obtain what you expect the b
would have to be constexpr
, and then the if
itself would have to be if constexpr
. Otherwise there's no binding commitment to evaluate this at compile time (and, in effect, if this gets trivially evaluated / removed in the FE, it induces dependence on optimisation level).
furthermore, the ...later... bit is pretty important: what happens on that path?
Anything in the world besides changing the value of
has_builtin
to something other than what__has_builtin
returned.if you do any of those, your distant has_builtin variable no longer reflects the predicate, which is the issue;
Why is that an issue? If the variable no longer reflects the predicate, that's not on the compiler to figure out how to deal with, that's "play silly games, win silly prizes".
It is a difficult conversation to have and not exactly what users want to hear, so making it as hard as possible to end up in an exchange where you have to say "welp, that was a silly game" cannot hurt. If anything, it's compassionate behaviour!
Backing up a step.. my expectation is that this eventually lowers down to a test and jump which jumps past the target code if the test fails. e.g.,
%0 = load i8, ptr %b, align 1 %loadedv = trunc i8 %0 to i1 br i1 %loadedv, label %if.then, label %if.end if.then: # the target-specific instructions live here br label %if.end if.end: ret void
So we'd be generating instructions for the target which may be invalid if the test lies. If something did change that value so it no longer represents the predicate, I think that's UB (and we could help users catch that UB via a sanitizer check if we wanted to, rather than try to make the backend have to try to figure it out at compile time).
This cannot work reliably (e.g. there are instructions that would simply fail at ISEL, and a run time jump doesn't mean that you do not lower to ISA the jumped around block), and introducing dependence on sanitizers seems not ideal. Furthermore, a run time jump isn't free, which is a concern for us, and we also already have a mechanism for that case (__attribute__((target))
). Note that these can also control e.g. resource allocation, so actually generating both might lead to arbitrary exhaustion of a limited resource, and spurious compilation failures, consider e.g. (I'll use CUDA/HIP syntax):
// This is a bit odd, and technically a race because multiple lanes write to shared_buf
void foo() {
__shared__ int* shared_buf;
if (__builtin_amdgcn_processor_is("gfx950") {
__shared__ int buf[70 * 1024];
shared_buf = buf;
} else {
__shared__ int buf[60 * 1024];
shared_buf = buf;
}
__syncthreads();
// use shared_buf
If we tried to lower that we'd exhaust LDS, and spuriously fail to compile. This would have originated from perfectly valid uses of #if defined(__gfx950__) #else
. We'd like these to work, so we must unambiguously do the fold ourselves.
if for a chain from point of cast to final point of use folding fails (because you passed your value to an
opaque function, modified it based on a run time value etc.), you get an error and a diagnostic.I was thinking you would not get a diagnostic; you'd get the behavior you asked for, which may be utter nonsense.
One of the difficulties here (ignoring that the utter nonsense behaviour at run time might be nasal demons - GPUs aren't always as polite as to issue a SIGILL
and graciously die:)) is that not all constructs / IR sequences / ASM uses lower into ISA, so what the user is more likely to get is an ICE with an error that makes no sense unless they work on LLVM. That's fairly grim user experience, IMHO, and one that we have the ability to prevent.
Am I missing something still? If so, maybe it would be quicker for us to hop in a telecon call? I'm going to be out of the office until Monday, but I'm happy to meet with you if that's more productive.
I would be absolutely happy to if you think it'd help. I regret not coming to the Sofia meeting, we could've probably sorted this out directly with a laptop:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __has_builtin counter-example actually does not work and cannot work, please see: https://gcc.godbolt.org/z/7G5Y1d85b.
I cannot imagine a situation in which this isn't indicative of a bug, but perhaps this situation is the same one that necessitated this PR which eventually concluded that we should change the behavior of
__has_builtin
rather than introduce a new builtin.This is not actually a bug, it's intended behaviour. To obtain what you expect the
b
would have to beconstexpr
, and then theif
itself would have to beif constexpr
. Otherwise there's no binding commitment to evaluate this at compile time (and, in effect, if this gets trivially evaluated / removed in the FE, it induces dependence on optimisation level).
I... am an idiot. :-D Sorry, I think I must have been braindead when I wrote that because you're exactly correct. Sorry for the noise!
Backing up a step.. my expectation is that this eventually lowers down to a test and jump which jumps past the target code if the test fails. e.g.,
%0 = load i8, ptr %b, align 1 %loadedv = trunc i8 %0 to i1 br i1 %loadedv, label %if.then, label %if.end if.then: # the target-specific instructions live here br label %if.end if.end: ret void
So we'd be generating instructions for the target which may be invalid if the test lies. If something did change that value so it no longer represents the predicate, I think that's UB (and we could help users catch that UB via a sanitizer check if we wanted to, rather than try to make the backend have to try to figure it out at compile time).
This cannot work reliably (e.g. there are instructions that would simply fail at ISEL, and a run time jump doesn't mean that you do not lower to ISA the jumped around block), and introducing dependence on sanitizers seems not ideal. Furthermore, a run time jump isn't free, which is a concern for us, and we also already have a mechanism for that case (
__attribute__((target))
). Note that these can also control e.g. resource allocation, so actually generating both might lead to arbitrary exhaustion of a limited resource, and spurious compilation failures, consider e.g. (I'll use CUDA/HIP syntax):// This is a bit odd, and technically a race because multiple lanes write to shared_buf void foo() { __shared__ int* shared_buf; if (__builtin_amdgcn_processor_is("gfx950") { __shared__ int buf[70 * 1024]; shared_buf = buf; } else { __shared__ int buf[60 * 1024]; shared_buf = buf; } __syncthreads(); // use shared_bufIf we tried to lower that we'd exhaust LDS, and spuriously fail to compile. This would have originated from perfectly valid uses of
#if defined(__gfx950__) #else
. We'd like these to work, so we must unambiguously do the fold ourselves.
Okay, so the situation is different than what I expected. I was unaware this would cause ISEL failures.
if for a chain from point of cast to final point of use folding fails (because you passed your value to an
opaque function, modified it based on a run time value etc.), you get an error and a diagnostic.I was thinking you would not get a diagnostic; you'd get the behavior you asked for, which may be utter nonsense.
One of the difficulties here (ignoring that the utter nonsense behaviour at run time might be nasal demons - GPUs aren't always as polite as to issue a
SIGILL
and graciously die:)) is that not all constructs / IR sequences / ASM uses lower into ISA, so what the user is more likely to get is an ICE with an error that makes no sense unless they work on LLVM. That's fairly grim user experience, IMHO, and one that we have the ability to prevent.
Yeah, we obviously don't want the user experience to be compiler crashes. :-)
Am I missing something still? If so, maybe it would be quicker for us to hop in a telecon call? I'm going to be out of the office until Monday, but I'm happy to meet with you if that's more productive.
I would be absolutely happy to if you think it'd help. I regret not coming to the Sofia meeting, we could've probably sorted this out directly with a laptop:)
FWIW, I'm still pretty uncomfortable about this design. I keep coming back to this feeling really novel and seeming like it's designed to work around backend issues. If the user did something like this:
void func(std::vector<bool> processor_features) {
if (processor_features[12]) { // SSE3 is allowed
__asm__ ("do a bunch of sse3 stuff");
} else {
// Do slow fallback stuff
}
}
they would reasonably expect that inline assembly to be non-problematic even if sse3 isn't available. But... when I try to play silly games in practice, we assert: https://godbolt.org/z/xc13WhW4W and so maybe I'm just wrong. CC @nikic for more opinions.
As for meeting to discuss, are you free sometime this week? I'm on US East Coast time, what times typically work best for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the bottom up, anything but Friday should be good, including today starting from now to now + 6 hours:) I'm in the UK, so the delta is not so large anyway, pick something that fits your schedule and I'll probably be able to make it work.
For your example at the bottom, the ASM is non-problematic in that it goes through. Now substitute it with a builtin that is only there iff SSE3 is available, or try to bind registers from the extended x86_64 set and compile for an x86 target, and it'll go back to failing at compile time. It's that latter part that is problematic even for the user's experience.
I suspect that part of the issue here is that something like X86 hides a lot of this stuff under normal circumstances because folks don't really normally grab for special functionality, or feel the need for it. But if we have a look at the many SIMD extension sets, as well as the attempts at defining various levels of capability (the v1, v2, v3 things) I think the same challenge exists there, it's just not an immediate concern. What we're trying to solve is that whilst it makes perfect sense for a BE, any BE, to bind very tightly to the target, it is sometimes beneficial for the IR coming out of the FE to be generic and usable by many targets, without loss of capability. Without a mechanism as the one here one is either degraded to lowest common denominator capability, or has to play games trying to define capability levels, which generally end up being too coarse.
Also, please note that, in spite of me mentioning x86, at this point we are not proposing this for general use, but rather as a target specific BI, which hopefully reduces risk / contains any perceived novelty to parts where it's already been found to be useful:)
@@ -12015,6 +12017,16 @@ static void DiagnoseBadConversion(Sema &S, OverloadCandidate *Cand, | |||
if (TakingCandidateAddress && !checkAddressOfCandidateIsAvailable(S, Fn)) | |||
return; | |||
|
|||
// __amdgpu_feature_predicate_t can be explicitly cast to the logical op type, | |||
// although this is almost always an error and we advise against it. | |||
if (FromTy == S.Context.AMDGPUFeaturePredicateTy && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here as above regarding the comment not seeming to match the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally very unhappy about any kind of functionality that can cause compilation failures either because the optimizer did not optimize enough (including at O0) or because it optimized too much (producing code patterns that are no longer recognized as trivially dead).
continue; | ||
if (G.getName().starts_with("llvm.amdgcn.")) | ||
Predicates.push_back(&G); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be represented using an intrinsic instead of magic globals. Otherwise transforming load @g
into %phi = phi [ @g ]; load %phi
becomes an invalid transform.
} | ||
|
||
std::pair<PreservedAnalyses, bool> handlePredicate(const GCNSubtarget &ST, | ||
GlobalVariable *P) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this per-predicate handling is going to break if two predicates get combined into a logical and at the IR level? When the first one is handled it will leave an unfoldable user, which would be foldable if both are handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true but I think that the this'd entail something like SCEV or InstructionSimplify running before the predicate expansion pass, which shouldn't happen with how we are ordering it. I might be missing some obvious case though, so if you have something in mind please share.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly I'm misunderstanding how the pipeline here looks like. My assumption was that you have something like this going on:
clang generates IR -> compilation 1 without known target -> compilation 2 with known target
Where the predicates are expanded at the start of compilation 2, but compilation 1 could have arbitrarily optimized the IR before that.
If the resolution always happens immediately on the clang-generated IR, then I don't understand the purpose of the feature (as compared to always resolving in the frontend, that is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is a good question, it's probably gotten lost in the lengthy conversation. We have two cases, let me try to clarify:
- We are targeting a concrete
gfx###
target, for which the features and capabilities are fully known at compile time / we know what we are lowering for -> the predicates get expanded and resolved in the FE, they never reach codegen / get emitted in IR; - We are targeting
amdgcnspirv
, which is abstract and for which the actual concrete target is only known at run time i.e. there's a lack of information / temporal decoupling:- the predicates allow one to write code that adapts to the capabilities of the actual target that the code will execute on;
- we only know the target once we resume compilation for the concrete target, hence the need to emit them in IR, and then expand.
The ultimate state of affairs (not there yet due to historical issues / ongoing work) is that for the 2nd case the IR we generate SPIRV from is directly the pristine Clang output (+transforms needed for SPIRV, which do not impact these), so when we resume compilation at run time, it's on un-optimised FE-output IR. Furthermore, the expansion pass runs unconditionally, and is independent from optimisation level (which also implies it needs to be better about cleaning after itself, which I still owe an answer for). Hopefully that helps / makes some degree of sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to clarify, optimizations will never be applied during the compilation to amdgcnspirv? If that's the case, I guess it's not likely that IR will be transformed in problematic ways.
It did occur to me that a way to guarantee that the folding works is by using a callbr intrinsic, something like this:
callbr void @llvm.amdgcn.processor.is(metadata "gfx803") to label %unsupported [label %supported]
This would make the check fundamentally inseparable from the control flow.
But I guess you'd have trouble round-tripping that via SPIRV...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to clarify, optimizations will never be applied during the compilation to amdgcnspirv? If that's the case, I guess it's not likely that IR will be transformed in problematic ways.
Yes, this is the intention, it is still ongoing work - empirically we are not running into any of the potential issues you brought up, which is why I went ahead with upstreaming this part which is fairly important for library work (hard to author high-performance generic libs without this sort of mechanism). By the end of this year we should end up generating SPIRV from Clang's LLVMIR output, with no optimisations applied.
It did occur to me that a way to guarantee that the folding works is by using a callbr intrinsic, something like this:
callbr void @llvm.amdgcn.processor.is(metadata "gfx803") to label %unsupported [label %supported]This would make the check fundamentally inseparable from the control flow.
But I guess you'd have trouble round-tripping that via SPIRV...
Ah, I actually hadn't thought of that but having had a glance yes, it's difficult to round trip. Something to consider in the future and if / when we try to make this generic rather than target specific, if there is interest.
} else if (I->isTerminator() && ConstantFoldTerminator(I->getParent())) { | ||
continue; | ||
} else if (I->users().empty()) { | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this case? Wouldn't this means that doing something like store i1 %predicate, ptr %somewhere
would count as "folded"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ill-formed (and also somewhat vestigial; a consequence of this being a bit long in the tooth), thank you for spotting it! The intention was to allow things like implicitly inserted llvm.assume
s / prevent them from causing spurious failures. However, as you point out, this was wrong.
return PreservedAnalyses::all(); | ||
|
||
const auto &ST = TM.getSubtarget<GCNSubtarget>( | ||
*find_if(M, [](auto &&F) { return !F.isIntrinsic(); })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be assuming the same subtarget for all functions? Does amdgpu not support target-features at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does but the (gfxSMTH
) target is uniform per compilation. The mechanism is roundabout but there's no other convenient way to query this information, at leas that I am aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not convenient, but you should evaluate this in each individual function context. Really most of the targets should have been defined as full targets, not subtargets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the real world the subtarget features for xnack may still differ between functions in a module
auto *I = *ToFold.begin(); | ||
ToFold.erase(I); | ||
|
||
if (auto *C = ConstantFoldInstruction(I, P->getDataLayout())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that this is relying on more optimization to happen afterwards for correctness, including at O0? We need the unreachable blocks to be DCEd, and any now unused functions to be DCEd, etc, otherwise we may get isel failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what regards unreachable BBs, this looks like so because I hadn't fully considered the implications, and because my understanding is that we (LLVM, not AMDGPU) unconditionally run 'UnreachableBlockElimPass', irrespective of optimisation level. I think the latter is not incorrect, and that there is at least one other transform ('LowerInvokePass') that creates unreachable BBs and leaves them around. Having said that, it's not very hygienic and I will add cleanup for unreachable BBs.
With functions it's a bit trickier, and can actually get into somewhat convoluted use cases, which these predicates, as low-level target specific things, are not meant for. To be more specific, with normal use one would expect that for any and all functions the user would've applied predicates locally at the finest possible granularity i.e.
// THIS
void foo() {
if (__builtin_processor_is("gfx900"))
do_something();
else if (__builtin_is_invocable(__builtin_amdgcn_some_builtin))
__builtin_amdgcn_some_builtin();
}
void bar() {
foo();
}
// NOT THIS
void foo() {
do_something_that_only_works_on_gfx900_no_guard();
__builtin_amdgcn_only_gfx900();
}
void bar() {
if (__builtin_processor_is("gfx900"))
foo();
}
If the guards are granular at expression / block scope at most, then there's no need to remove unused functions as they'd have been "cleaned up", for lack of a better word. I do appreciate that that is not an entirely satisfactory answer. I would lightly argue that since the second case is an anti-pattern (imagine these are proper large functions), it failing at compile time during ISEL is not that bad / an opportunity to not write it in the first place. Having said that, here's how we could handle functions:
- We could remove functions with internal linkage, iff they end up unused after predicate expansion, as that implies that their only uses were predicate guarded;
- We cannot do this for functions with external linkage (using internal and external loosely here), as they might have other valid uses in other TUs;
- What we can do for the latter is:
- Tag (metadata / attribute) when running the predicate expansion makes a previously used function unused;
- Add an
UnreachableFuncElimPass
which unconditionally runs right before ISEL, and removes functions iff they are unused and carry the tag;- We can only do this for AMDGPU since at the moment we do not do dynamic linking
Dealing with the first category is straightforward, I could add it now or in a follow-up patch (I am not entirely sure that we do not already remove these unconditionally before ISEL anyway, the AMDGPU opt
pipeline is fairly voluminous).
Fortunately, that wouldn't be the case here, I don't think, unless you have something specific in mind (asides from the inquiry about what happens with now inaccessible blocks / dead functions, which I'll address where it was asked). |
We briefly discussed this in the clang area team meeting, and we weren't really happy with the design as-is. The underlying idea behind the feature makes sense, but semantics of the actual builtin is ugly: there's a loose connection between the condition check, and the region of code it's guarding. I spent a bit more time thinking about it after the meeting. Here's a potential alternative design: we add a new kind of if statement, a "processor-feature-if", spelled something like In the case where the target features are known during clang codegen, lowering is easy: you just skip generating the bodies of the if statements that don't match. If you want to some kind of "runtime" (actual runtime, or SPIR-V compilation-time) detection, it's not clear what the LLVM IR should look like: we only support specifying target features on a per-function level. But we can look at that separately. |
Recognizing when the |
I mean, I'm not particularly attached to the syntax of the "if". I guess we could designate |
Yeah, I agree with the other parts of your design, enabling the builtins within the guarded statements is a great way to handle it. On a different point: I don't think this builtin is actually semantically different from We could allow |
We already use |
I don't quite see how to parse this statement to make it address the actual use case. These are useful because we cannot know, at the AST level (in the FE) which processor features are available. If we knew that we don't really need any additional mechanism, so this is just a different way to type |
The If you / other Clang owners are happy with extending |
Whilst I am thankful for the feedback I think it is somewhat unfortunate that we could not have a shared discussion about this, since I think that there are some core misunderstandings that seem to recur, which makes forward progress one way or the other difficult.
This has been considered, and doesn't quite address the use case (without ending up where the currently proposed design already is). Whilst this would have been significantly easier to discuss directly, I will try to enumerate the issues here:
As I have already mentioned in one of the replies to @rjmccall , this would be duplicating existing functionality, possibly in a more verbose and roundabout way. It is also already handled by what is being proposed, hence the awareness of it was present when the currently proposed design was put together. The interesting case is the second one, so, sadly, we cannot just look at that separately (and, IMHO, should not come up with novel IR constructs to solve this). If the core of the objection here is that Clang really doesn't like that we're doing semantic checking for these / they have too large a footprint for target specific builtins, I can always just delete those bits, have these return bool and that's that. It will force us to maintain an OOT delta, so it's not ideal, but if that's what it takes to make forward progress, it is what it is. |
Right, I don't see any semantic reason why Where exactly the folding is done doesn't seem like something that we need to have an opinion on at the language level. As long as we're not making it a constant expression (which would specifically force it to be folded by the frontend), folding in the frontend vs. folding in a later pass seems like an implementation detail that programmers don't need to care about. So if AMDGPU needs to fold in a pass, great, fold in a pass. |
Alex, can you talk about why your design decides to check for specific builtins rather than building out the set of features supported by |
This would be up to the target to evaluate, as it'd have knowledge of whether the argument to the call is sufficient. When I initially implemented this, I added additional Slightly independently, |
Hmm. Well, you get to define what feature names you recognize in |
I went into it a bit above without having seen your question (race condition I guess:) ), but to have it in one spot:
Now, this is specific to AMDGPU, I don't want to speculate too much about how other targets deal with this - which is another reason for which these are target builtins rather than going for something more generic. |
Let me add my few cents here.
Let me try to attempt to answer this question without introducing a new builtin in clang (at first). In SPIR-V there is specialization constant which AFAIK doesn't have a direct LLVM IR counterpart.
At runtime, when such SPIR-V module is JIT compiled OpSpecConstant materializes, so DCE (or better say some variation of DCE that is enforced to work with optnone) will be able to reason about %cmp result removing the dead branch, so we won't get unsupported feature at codegen. Problem is: how to generate such SPIR-V from clang. So my understanding, that the new builtin should eventually lowered (by SPIR-V backend?) to a construct like in the pseudo-code, though that is not what is currently happening. And I believe, that existing |
This is one possible implementation indeed, for a workflow that goes from SPIR-V to ISA, or chooses to do the DCE in SPIR-V. Due to having to compose with an existing mature toolchain, rather than starting fresh, we have a slightly different flow where we reverse translate to LLVM IR and "resume" compilation from that point. Hence, the implicitly inserted never to be emitted globals, which play the role the spec constants play in your example, when coupled with the dedicated predicate expansion pass. Something similar could be added to e.g. |
High liklihood that I'll need something similar for my GPU libraries so I'd prefer something not explicitly tied to SPIR-V. |
An intrinsic seems like the right IR model for CPU recognition, even for targets that don't specifically need to late-resolve it. That should be much easier for passes to optimize based on CPU settings than directly emitting the compiler-rt reference in the frontend. I know that generating IR with conservative target options and then bumping the target CPU in a pass is something various people have been interested in, so late optimization is specifically worth planning for here. We do have a theoretical problem with guaranteeing that non-matching code isn't emitted, because LLVM IR doesn't promise to leave a code sequence like this alone:
LLVM could theoretically complicate this by e.g. introducing a PHI or an |
The solution we went with here (for our use case) is to just run the predicate expansion pass over pristine Clang generated IR, before any other optimisation. I think that @nikic suggested an alternative based on |
We didn't really say much on the call itself; we just spent a minute while we were going through controversial RFCs/PRs, to call this out as something that needed attention. If you think this topic would benefit from a meeting, we can organize one... but maybe a 1-on-1 chat would be better to start with, just to make sure we're on the same page.
If you have a construct like the following:
We can tell, statically, that the first call is correctly guarded by an if statement: it's guaranteed it will never run on a non-gfx9000 processor. The second call, on the other hand, is not. So we can add a frontend rule: the first call is legal, the second is not. Obviously the error has false positives, in the sense that we can't actually prove the second call is incorrect at runtime... but that's fine, probably. What I don't want is that we end up with, essentially, the same constraint, but enforced by the backend.
Sure; we can't stop people from calling arbitrary pointers.
There are ways to solve this: for example, we can make the llvm.compiler.supports produce a token, and staple that token onto the intrinsics using a bundle. Making this work requires that IRGen knows which intrinic calls are actually impacted... I care less about exactly how we solve this because we can adjust the solution later. Whatever we expose in the frontend is much harder to change later. |
Definitely, more than happy to have a 1-on-1 (2-on-1 even, since I think @AaronBallman also suggested something along these lines as well :) ).
I will note that on concrete targets, what is being proposed already works as described, by virtue of it being an error to call a builtin that is not available. Having said that, this gives me some trepidation and I think it can end up being user adverse. Consider the following case: void foo() {
if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_gfx9000_specific_intrinsic))
__builtin_amdgcn_gfx9000_specific_intrinsic;
}
void bar() {
if (__builtin_amdgcn_processor_is("gfx9000")
foo();
foo();
} We've just made the call to foo() illegal on anything that is not gfx9000, but that builtin / intrinsic could exist in 8999 other gfx versions. These don't always form binary, mutually exclusive structures. So I think I disagree with the "that's fine, probably".
Could you please detail why? Ultimately the BE still gets to decide on the legality of things that tweak it pretty intrinsically, even if said things come from otherwise linguistically correct constructs which have passed FE analysis. Also, we'd never really reach the BE, we're just sliding in immediately after Clang, before optimisation, so there's still enough info to provide an useful error message. Furthermore, this might be a better point to check anyways, as linking in bitcode could / should have already occured, so what would otherwise have been external symbols that impact viability would now be satisfied.
Between making the wrong choice and going with something that's user adverse early on, then trying to build increasingly complicated mechanisms to make it work, I would prefer we just left these as target specific, low level builtins returning |
Please email me with some times that will work for you.
I... don't think I'm suggesting this? The fact that a call to foo() from a __builtin_amdgcn_processor_is block shouldn't imply anything about other calls to foo(). What I'm basically suggesting is just exposing SPIR-V specialization constants as a C construct. Your example SPIR-V was something like:
We want to come up with a corresponding C construct that's guaranteed to compile to valid SPIR-V. My suggestion is something like:
In the body of the if statement, you can use whatever intrinsics are legal on hw_id_that_supports_feature.
Isn't doing checks immediately after IR generation basically the same as checking the AST, just on a slightly different representation? |
This change adds two semi-magical builtins for AMDGPU:
__builtin_amdgcn_processor_is
, which is similar in observable behaviour with__builtin_cpu_is
, except that it is never "evaluated" at run time;__builtin_amdgcn_is_invocable
, which is behaviourally similar with__has_builtin
, except that it is not a macro (i.e. not evaluated at preprocessing time).Neither of these are
constexpr
, even though when compiling for concrete (i.e.gfxXXX
/gfxXXX-generic
) targets they get evaluated in Clang, so they shouldn't tear the AST too badly / at all for multi-pass compilation cases like HIP. They can only be used in specific contexts (as args to control structures).The motivation for adding these is two-fold:
I've tried to keep the overall footprint of the change small. The changes to Sema are a bit unpleasant, but there was a strong desire to have Clang validate these, and to constrain their uses, and this was the most compact solution I could come up with (suggestions welcome).
In the end, I will note there is nothing that is actually AMDGPU specific here, so it is possible that in the future, assuming interests from other targets / users, we'd just promote them to generic intrinsics.