Skip to content

[ARM,AArch64] Don't put BTI at asm goto branch targets #141562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 3, 2025

Conversation

statham-arm
Copy link
Collaborator

In 'asm goto' statements ('callbr' in LLVM IR), you can specify one or more labels / basic blocks in the containing function which the assembly code might jump to. If you're also compiling with branch target enforcement via BTI, then previously listing a basic block as a possible jump destination of an asm goto would cause a BTI instruction to be placed at the start of the block, in case the assembly code used an indirect branch instruction (i.e. to a destination address read from a register) to jump to that location. Now it doesn't do that any more: branches to destination labels from the assembly code are assumed to be direct branches (to a relative offset encoded in the instruction), which don't require a BTI at their destination.

This change was proposed in https://discourse.llvm.org/t/85845 and there seemed to be no disagreement. The rationale is:

  1. it brings clang's handling of asm goto in Arm and AArch64 in line with gcc's, which didn't generate BTIs at the target labels in the first place.

  2. it improves performance in the Linux kernel, which uses a lot of 'asm goto' in which the assembly language just contains a NOP, and the label's address is saved elsewhere to let the kernel self-modify at run time to swap between the original NOP and a direct branch to the label. This allows hot code paths to be instrumented for debugging, at only the cost of a NOP when the instrumentation is turned off, instead of the larger cost of an indirect branch. In this situation a BTI is unnecessary (if the branch happens it's direct), and since the code paths are hot, also a noticeable performance hit.

Implementation:

SelectionDAGBuilder::visitCallBr is the place where 'asm goto' target labels are handled. It calls setIsInlineAsmBrIndirectTarget() on each target MachineBasicBlock. Previously it also called setMachineBlockAddressTaken(), which made hasAddressTaken() return true, which caused a BTI to be added in the Arm backends.

Now visitCallBr doesn't call setMachineBlockAddressTaken() any more on asm goto targets, but hasAddressTaken() also checks the flag set by setIsInlineAsmBrIndirectTarget(). So call sites that were using hasAddressTaken() don't need to be modified. But the Arm backends don't call hasAddressTaken() any more: instead they test two more specific query functions that cover all the reasons hasAddressTaken() might have returned true except being an asm goto target.

Testing:

The new test AArch64/callbr-asm-label-bti.ll is testing the actual change, where it expects not to see a bti instruction after [[LABEL]]. The rest of the test changes are all churn, due to the flags on basic blocks changing. Actual output code hasn't changed in any of the existing tests, only comments and diagnostics.

Further work:

RISCVIndirectBranchTracking.cpp and X86IndirectBranchTracking.cpp also call hasAddressTaken() in a way that might benefit from using the same more specific check I've put in ARMBranchTargets.cpp and AArch64BranchTargets.cpp. But I'm not sure of that, so in this commit I've only changed the Arm backends, and left those alone.

In 'asm goto' statements ('callbr' in LLVM IR), you can specify one or
more labels / basic blocks in the containing function which the
assembly code might jump to. If you're also compiling with branch
target enforcement via BTI, then previously listing a basic block as a
possible jump destination of an asm goto would cause a BTI instruction
to be placed at the start of the block, in case the assembly code used
an _indirect_ branch instruction (i.e. to a destination address read
from a register) to jump to that location. Now it doesn't do that any
more: branches to destination labels from the assembly code are
assumed to be direct branches (to a relative offset encoded in the
instruction), which don't require a BTI at their destination.

This change was proposed in https://discourse.llvm.org/t/85845 and
there seemed to be no disagreement. The rationale is:

1. it brings clang's handling of asm goto in Arm and AArch64 in line
with gcc's, which didn't generate BTIs at the target labels in the
first place.

2. it improves performance in the Linux kernel, which uses a lot of
'asm goto' in which the assembly language just contains a NOP, and the
label's address is saved elsewhere to let the kernel self-modify at
run time to swap between the original NOP and a direct branch to the
label. This allows hot code paths to be instrumented for debugging, at
only the cost of a NOP when the instrumentation is turned off, instead
of the larger cost of an indirect branch. In this situation a BTI is
unnecessary (if the branch happens it's direct), and since the code
paths are hot, also a noticeable performance hit.

Implementation:

`SelectionDAGBuilder::visitCallBr` is the place where 'asm goto'
target labels are handled. It calls `setIsInlineAsmBrIndirectTarget()`
on each target `MachineBasicBlock`. Previously it also called
`setMachineBlockAddressTaken()`, which made `hasAddressTaken()` return
true, which caused a BTI to be added in the Arm backends.

Now `visitCallBr` doesn't call `setMachineBlockAddressTaken()` any
more on asm goto targets, but `hasAddressTaken()` also checks the flag
set by `setIsInlineAsmBrIndirectTarget()`. So call sites that were
using `hasAddressTaken()` don't need to be modified. But the Arm
backends don't call `hasAddressTaken()` any more: instead they test
two more specific query functions that cover all the reasons
`hasAddressTaken()` might have returned true _except_ being an asm
goto target.

Testing:

The new test `AArch64/callbr-asm-label-bti.ll` is testing the actual
change, where it expects not to see a `bti` instruction after
`[[LABEL]]`. The rest of the test changes are all churn, due to the
flags on basic blocks changing. Actual output code hasn't changed in
any of the existing tests, only comments and diagnostics.

Further work:

`RISCVIndirectBranchTracking.cpp` and `X86IndirectBranchTracking.cpp`
also call `hasAddressTaken()` in a way that might benefit from using
the same more specific check I've put in `ARMBranchTargets.cpp` and
`AArch64BranchTargets.cpp`. But I'm not sure of that, so in this
commit I've only changed the Arm backends, and left those alone.
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:ARM backend:AArch64 backend:PowerPC backend:X86 llvm:SelectionDAG SelectionDAGISel as well llvm:ir labels May 27, 2025
@llvmbot
Copy link
Member

llvmbot commented May 27, 2025

@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-powerpc

@llvm/pr-subscribers-backend-aarch64

Author: Simon Tatham (statham-arm)

Changes

In 'asm goto' statements ('callbr' in LLVM IR), you can specify one or more labels / basic blocks in the containing function which the assembly code might jump to. If you're also compiling with branch target enforcement via BTI, then previously listing a basic block as a possible jump destination of an asm goto would cause a BTI instruction to be placed at the start of the block, in case the assembly code used an indirect branch instruction (i.e. to a destination address read from a register) to jump to that location. Now it doesn't do that any more: branches to destination labels from the assembly code are assumed to be direct branches (to a relative offset encoded in the instruction), which don't require a BTI at their destination.

This change was proposed in https://discourse.llvm.org/t/85845 and there seemed to be no disagreement. The rationale is:

  1. it brings clang's handling of asm goto in Arm and AArch64 in line with gcc's, which didn't generate BTIs at the target labels in the first place.

  2. it improves performance in the Linux kernel, which uses a lot of 'asm goto' in which the assembly language just contains a NOP, and the label's address is saved elsewhere to let the kernel self-modify at run time to swap between the original NOP and a direct branch to the label. This allows hot code paths to be instrumented for debugging, at only the cost of a NOP when the instrumentation is turned off, instead of the larger cost of an indirect branch. In this situation a BTI is unnecessary (if the branch happens it's direct), and since the code paths are hot, also a noticeable performance hit.

Implementation:

SelectionDAGBuilder::visitCallBr is the place where 'asm goto' target labels are handled. It calls setIsInlineAsmBrIndirectTarget() on each target MachineBasicBlock. Previously it also called setMachineBlockAddressTaken(), which made hasAddressTaken() return true, which caused a BTI to be added in the Arm backends.

Now visitCallBr doesn't call setMachineBlockAddressTaken() any more on asm goto targets, but hasAddressTaken() also checks the flag set by setIsInlineAsmBrIndirectTarget(). So call sites that were using hasAddressTaken() don't need to be modified. But the Arm backends don't call hasAddressTaken() any more: instead they test two more specific query functions that cover all the reasons hasAddressTaken() might have returned true except being an asm goto target.

Testing:

The new test AArch64/callbr-asm-label-bti.ll is testing the actual change, where it expects not to see a bti instruction after [[LABEL]]. The rest of the test changes are all churn, due to the flags on basic blocks changing. Actual output code hasn't changed in any of the existing tests, only comments and diagnostics.

Further work:

RISCVIndirectBranchTracking.cpp and X86IndirectBranchTracking.cpp also call hasAddressTaken() in a way that might benefit from using the same more specific check I've put in ARMBranchTargets.cpp and AArch64BranchTargets.cpp. But I'm not sure of that, so in this commit I've only changed the Arm backends, and left those alone.


Patch is 70.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/141562.diff

28 Files Affected:

  • (modified) clang/docs/LanguageExtensions.rst (+11)
  • (modified) llvm/docs/LangRef.rst (+8)
  • (modified) llvm/include/llvm/CodeGen/MachineBasicBlock.h (+16-1)
  • (modified) llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp (+2)
  • (modified) llvm/lib/CodeGen/BasicBlockPathCloning.cpp (+9-2)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+4-1)
  • (modified) llvm/lib/Target/AArch64/AArch64BranchTargets.cpp (+2-1)
  • (modified) llvm/lib/Target/ARM/ARMBranchTargets.cpp (+2-1)
  • (added) llvm/test/CodeGen/AArch64/callbr-asm-label-bti.ll (+40)
  • (modified) llvm/test/CodeGen/AArch64/callbr-asm-label.ll (+3-3)
  • (modified) llvm/test/CodeGen/AArch64/callbr-asm-outputs-indirect-isel.ll (+15-15)
  • (modified) llvm/test/CodeGen/PowerPC/callbr-asm-outputs-indirect-isel.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/ppc64-inlineasm-clobber.ll (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/inline-asm-mem-constraint.ll (+60-60)
  • (modified) llvm/test/CodeGen/X86/basic-block-sections-cloning-invalid.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/callbr-asm-blockplacement.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/callbr-asm-branch-folding.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/callbr-asm-destinations.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/callbr-asm-label-addr.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/callbr-asm-outputs-indirect-isel-m32.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/callbr-asm-outputs-indirect-isel.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/callbr-asm-outputs-pred-succ.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/callbr-asm-outputs.ll (+12-12)
  • (modified) llvm/test/CodeGen/X86/callbr-asm-phi-placement.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/callbr-asm-sink.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/callbr-asm.ll (+7-7)
  • (modified) llvm/test/CodeGen/X86/shrinkwrap-callbr.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/tail-dup-asm-goto.ll (+2-2)
diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index a40dd4d1a1673..84b6c0c87a799 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -2042,6 +2042,17 @@ references can be used instead of numeric references.
       return -1;
   }
 
+ASM Goto versus Branch Target Enforcement
+=========================================
+
+Some target architectures implement branch target enforcement, by requiring
+indirect (register-controlled) branch instructions to jump only to locations
+marked by a special instruction (such as AArch64 ``bti``).
+
+The assembler code inside an ``asm goto`` statement is expected not to use a
+branch instruction of that kind to transfer control to any of its destination
+labels. Therefore, using a label in an ``asm goto`` statement does not cause
+clang to put a ``bti`` or equivalent instruction at the label.
 
 Constexpr strings in GNU ASM statements
 =======================================
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 8c0a046d3a7e9..6a4bf6e594d14 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -9596,6 +9596,14 @@ may not be equal to the address provided for the same block to this
 instruction's ``indirect labels`` operand. The assembly code may only transfer
 control to addresses provided via this instruction's ``indirect labels``.
 
+On target architectures that implement branch target enforcement by requiring
+indirect (register-controlled) branch instructions to jump only to locations
+marked by a special instruction (such as AArch64 ``bti``), the called code is
+expected not to use such an indirect branch to transfer control to the
+locations in ``indirect labels``. Therefore, including a label in the
+``indirect labels`` of a ``callbr`` does not require the compiler to put a
+``bti`` or equivalent instruction at the label.
+
 Arguments:
 """"""""""
 
diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
index 201e35d30cee2..284756e0b8e30 100644
--- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h
+++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
@@ -275,18 +275,33 @@ class MachineBasicBlock
   /// of a terminator, exception-handling target, or jump table. This is
   /// either the result of an IR-level "blockaddress", or some form
   /// of target-specific branch lowering.
+  ///
+  /// The name of this function `hasAddressTaken` implies that the address of
+  /// the block is known and used in a general sense, but not necessarily that
+  /// the address is used by an indirect branch instruction. So branch target
+  /// enforcement need not put a BTI instruction (or equivalent) at the start
+  /// of a block just because this function returns true. The decision about
+  /// whether to add a BTI can be more subtle than that, and depends on the
+  /// more detailed checks that this function aggregates together.
   bool hasAddressTaken() const {
-    return MachineBlockAddressTaken || AddressTakenIRBlock;
+    return MachineBlockAddressTaken || AddressTakenIRBlock ||
+           IsInlineAsmBrIndirectTarget;
   }
 
   /// Test whether this block is used as something other than the target of a
   /// terminator, exception-handling target, jump table, or IR blockaddress.
   /// For example, its address might be loaded into a register, or
   /// stored in some branch table that isn't part of MachineJumpTableInfo.
+  ///
+  /// If this function returns true, it _does_ mean that branch target
+  /// enforcement needs to put a BTI or equivalent at the start of the block.
   bool isMachineBlockAddressTaken() const { return MachineBlockAddressTaken; }
 
   /// Test whether this block is the target of an IR BlockAddress.  (There can
   /// more than one MBB associated with an IR BB where the address is taken.)
+  ///
+  /// If this function returns true, it _does_ mean that branch target
+  /// enforcement needs to put a BTI or equivalent at the start of the block.
   bool isIRBlockAddressTaken() const { return AddressTakenIRBlock; }
 
   /// Retrieves the BasicBlock which corresponds to this MachineBasicBlock.
diff --git a/llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp b/llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
index 1322973cb92de..b431896203bab 100644
--- a/llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
+++ b/llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
@@ -4330,6 +4330,8 @@ void AsmPrinter::emitBasicBlockStart(const MachineBasicBlock &MBB) {
       OutStreamer->emitLabel(Sym);
   } else if (isVerbose() && MBB.isMachineBlockAddressTaken()) {
     OutStreamer->AddComment("Block address taken");
+  } else if (isVerbose() && MBB.isInlineAsmBrIndirectTarget()) {
+    OutStreamer->AddComment("Inline asm indirect target");
   }
 
   // Print some verbose block comments.
diff --git a/llvm/lib/CodeGen/BasicBlockPathCloning.cpp b/llvm/lib/CodeGen/BasicBlockPathCloning.cpp
index 19f824850607c..b58c60d1db0a9 100644
--- a/llvm/lib/CodeGen/BasicBlockPathCloning.cpp
+++ b/llvm/lib/CodeGen/BasicBlockPathCloning.cpp
@@ -121,14 +121,21 @@ bool IsValidCloning(const MachineFunction &MF,
       }
       if (PathBB->isMachineBlockAddressTaken()) {
         // Avoid cloning blocks which have their address taken since we can't
-        // rewire branches to those blocks as easily (e.g., branches within
-        // inline assembly).
+        // rewire branches to those blocks as easily.
         WithColor::warning()
             << "block #" << BBID
             << " has its machine block address taken in function "
             << MF.getName() << "\n";
         return false;
       }
+      if (PathBB->isInlineAsmBrIndirectTarget()) {
+        // Similarly for branches to the block within an asm goto.
+        WithColor::warning()
+            << "block #" << BBID
+            << " is a branch target of an 'asm goto' in function "
+            << MF.getName() << "\n";
+        return false;
+      }
     }
 
     if (I != ClonePath.size() - 1 && !PathBB->empty() &&
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index ca195cb37de8a..47f532fe034d9 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -3399,7 +3399,10 @@ void SelectionDAGBuilder::visitCallBr(const CallBrInst &I) {
     BasicBlock *Dest = I.getIndirectDest(i);
     MachineBasicBlock *Target = FuncInfo.getMBB(Dest);
     Target->setIsInlineAsmBrIndirectTarget();
-    Target->setMachineBlockAddressTaken();
+    // If there was a type of asm goto statement that was permitted to
+    // use an indirect call instruction to jump to its labels, then we
+    // would also have to call Target->setMachineBlockAddressTaken()
+    // here to mark the target block as requiring a BTI.
     Target->setLabelMustBeEmitted();
     // Don't add duplicate machine successors.
     if (Dests.insert(Dest).second)
diff --git a/llvm/lib/Target/AArch64/AArch64BranchTargets.cpp b/llvm/lib/Target/AArch64/AArch64BranchTargets.cpp
index 2a96f15c20f6b..3436dc9ef4521 100644
--- a/llvm/lib/Target/AArch64/AArch64BranchTargets.cpp
+++ b/llvm/lib/Target/AArch64/AArch64BranchTargets.cpp
@@ -99,7 +99,8 @@ bool AArch64BranchTargets::runOnMachineFunction(MachineFunction &MF) {
 
     // If the block itself is address-taken, it could be indirectly branched
     // to, but not called.
-    if (MBB.hasAddressTaken() || JumpTableTargets.count(&MBB))
+    if (MBB.isMachineBlockAddressTaken() || MBB.isIRBlockAddressTaken() ||
+        JumpTableTargets.count(&MBB))
       CouldJump = true;
 
     if (CouldCall || CouldJump) {
diff --git a/llvm/lib/Target/ARM/ARMBranchTargets.cpp b/llvm/lib/Target/ARM/ARMBranchTargets.cpp
index 17d0bdd875121..409482b9679d8 100644
--- a/llvm/lib/Target/ARM/ARMBranchTargets.cpp
+++ b/llvm/lib/Target/ARM/ARMBranchTargets.cpp
@@ -77,7 +77,8 @@ bool ARMBranchTargets::runOnMachineFunction(MachineFunction &MF) {
     // modes. These modes do not support PACBTI. As a result, BTI instructions
     // are not added in the destination blocks.
 
-    if (IsFirstBB || MBB.hasAddressTaken() || MBB.isEHPad()) {
+    if (IsFirstBB || MBB.isMachineBlockAddressTaken() ||
+        MBB.isIRBlockAddressTaken() || MBB.isEHPad()) {
       addBTI(TII, MBB, IsFirstBB);
       MadeChange = true;
     }
diff --git a/llvm/test/CodeGen/AArch64/callbr-asm-label-bti.ll b/llvm/test/CodeGen/AArch64/callbr-asm-label-bti.ll
new file mode 100644
index 0000000000000..657b5e304c7c0
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/callbr-asm-label-bti.ll
@@ -0,0 +1,40 @@
+; RUN: llc < %s -mtriple=aarch64-linux-gnu | FileCheck %s
+
+; Test function which compares two integers and returns the value of
+; the overflow flag, by using an asm goto to make the asm block branch
+; based on that flag, and then a phi to set the return value based on
+; whether the branch was taken.
+define i32 @overflow(i64 %a, i64 %b) #0 {
+asm:
+  callbr void asm sideeffect "cmp $0, $1 \0A\09 b.vs ${2:l}",
+          "r,r,!i,~{cc}"(i64 %a, i64 %b)
+          to label %fallthrough [label %indirect]
+
+indirect:
+  br label %fallthrough
+
+fallthrough:
+  ; Return 1 if we came via the 'indirect' block (because the b.vs was
+  ; taken), and 0 if we came straight from the asm block (because it
+  ; was untaken).
+  %retval = phi i32 [0, %asm], [1, %indirect]
+  ret i32 %retval
+}
+
+; CHECK: overflow:
+; CHECK-NEXT: .cfi_startproc
+; CHECK-NEXT: // %bb.{{[0-9]+}}:
+; CHECK-NEXT: bti c
+; CHECK-NEXT: //APP
+; CHECK-NEXT: cmp x0, x1
+; CHECK-NEXT: b.vs [[LABEL:\.[A-Za-z0-9_]+]]
+; CHECK-NEXT: //NO_APP
+; CHECK-NEXT: // %bb.{{[0-9]+}}:
+; CHECK-NEXT: mov w0, wzr
+; CHECK-NEXT: ret
+; CHECK-NEXT: [[LABEL]]:
+; CHECK-NOT:  bti
+; CHECK:      mov w0, #1
+; CHECK-NEXT: ret
+
+attributes #0 = { "branch-target-enforcement" "target-features"="+bti" }
diff --git a/llvm/test/CodeGen/AArch64/callbr-asm-label.ll b/llvm/test/CodeGen/AArch64/callbr-asm-label.ll
index 1818f94a831b9..0a49cfa11afec 100644
--- a/llvm/test/CodeGen/AArch64/callbr-asm-label.ll
+++ b/llvm/test/CodeGen/AArch64/callbr-asm-label.ll
@@ -7,7 +7,7 @@ define i32 @test1() {
 ; CHECK:         .word b
 ; CHECK-NEXT:    .word .LBB0_2
 ; CHECK: // %bb.1:
-; CHECK: .LBB0_2: // Block address taken
+; CHECK: .LBB0_2: // Inline asm indirect target
 entry:
   callbr void asm sideeffect "1:\0A\09.word b, ${0:l}\0A\09", "!i"()
           to label %cleanup [label %indirect]
@@ -31,7 +31,7 @@ entry:
 if.then:
 ; CHECK:       .word b
 ; CHECK-NEXT:  .word .LBB1_3
-; CHECK:       .LBB1_3: // Block address taken
+; CHECK:       .LBB1_3: // Inline asm indirect target
   callbr void asm sideeffect "1:\0A\09.word b, ${0:l}\0A\09", "!i"()
           to label %if.then4 [label %if.end6]
 
@@ -46,7 +46,7 @@ if.end6:
   br i1 %phitmp, label %if.end10, label %if.then9
 
 if.then9:
-; CHECK: .LBB1_5: // Block address taken
+; CHECK: .LBB1_5: // Inline asm indirect target
   callbr void asm sideeffect "", "!i"()
           to label %if.end10 [label %l_yes]
 
diff --git a/llvm/test/CodeGen/AArch64/callbr-asm-outputs-indirect-isel.ll b/llvm/test/CodeGen/AArch64/callbr-asm-outputs-indirect-isel.ll
index fbe89e70e4d8e..00d5d20de8fe8 100644
--- a/llvm/test/CodeGen/AArch64/callbr-asm-outputs-indirect-isel.ll
+++ b/llvm/test/CodeGen/AArch64/callbr-asm-outputs-indirect-isel.ll
@@ -22,7 +22,7 @@ define i32 @test0() {
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gpr32all = COPY %5
   ; CHECK-NEXT:   B %bb.2
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.1.entry.indirect_crit_edge (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.1.entry.indirect_crit_edge (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.5(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gpr32all = COPY %5
@@ -35,7 +35,7 @@ define i32 @test0() {
   ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:gpr32all = COPY %7
   ; CHECK-NEXT:   B %bb.4
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.3.direct.indirect_crit_edge (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.3.direct.indirect_crit_edge (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.5(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:gpr32all = COPY %7
@@ -87,7 +87,7 @@ define i32 @dont_split0() {
   ; CHECK-NEXT:   $w0 = COPY [[MOVi32imm]]
   ; CHECK-NEXT:   RET_ReallyLR implicit $w0
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.2.y (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.2.y (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gpr32all = COPY $wzr
   ; CHECK-NEXT:   $w0 = COPY [[COPY]]
   ; CHECK-NEXT:   RET_ReallyLR implicit $w0
@@ -116,7 +116,7 @@ define i32 @dont_split1() {
   ; CHECK-NEXT:   $w0 = COPY [[MOVi32imm]]
   ; CHECK-NEXT:   RET_ReallyLR implicit $w0
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.2.y (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.2.y (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   $w0 = COPY %1
   ; CHECK-NEXT:   RET_ReallyLR implicit $w0
 entry:
@@ -147,7 +147,7 @@ define i32 @dont_split2() {
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gpr32all = COPY $wzr
   ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:gpr32all = COPY [[COPY1]]
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.2.y (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.2.y (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   [[PHI:%[0-9]+]]:gpr32all = PHI [[COPY]], %bb.0, [[COPY2]], %bb.1
   ; CHECK-NEXT:   $w0 = COPY [[PHI]]
   ; CHECK-NEXT:   RET_ReallyLR implicit $w0
@@ -174,7 +174,7 @@ define i32 @dont_split3() {
   ; CHECK-NEXT: bb.1.x:
   ; CHECK-NEXT:   successors: %bb.2(0x80000000)
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.2.v (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.2.v (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   [[MOVi32imm:%[0-9]+]]:gpr32 = MOVi32imm 42
   ; CHECK-NEXT:   $w0 = COPY [[MOVi32imm]]
   ; CHECK-NEXT:   RET_ReallyLR implicit $w0
@@ -198,7 +198,7 @@ define i32 @split_me0() {
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gpr32all = COPY %3
   ; CHECK-NEXT:   B %bb.2
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.1.entry.y_crit_edge (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.1.entry.y_crit_edge (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.3(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gpr32all = COPY %3
@@ -248,7 +248,7 @@ define i32 @split_me1(i1 %z) {
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gpr32all = COPY %5
   ; CHECK-NEXT:   B %bb.3
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.2.w.v_crit_edge (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.2.w.v_crit_edge (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.4(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:gpr32all = COPY %5
@@ -301,7 +301,7 @@ define i32 @split_me2(i1 %z) {
   ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:gpr32all = COPY %6
   ; CHECK-NEXT:   B %bb.3
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.2.w.v_crit_edge (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.2.w.v_crit_edge (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.4(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:gpr32all = COPY %6
@@ -349,7 +349,7 @@ define i32 @dont_split4() {
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   B %bb.3
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.2.y (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.2.y (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.3(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gpr32all = COPY %3
@@ -383,7 +383,7 @@ define i32 @dont_split5() {
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gpr32all = COPY %3
   ; CHECK-NEXT:   B %bb.2
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.1.y (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.1.y (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.2(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gpr32all = COPY %3
@@ -414,7 +414,7 @@ define i32 @split_me3() {
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gpr32all = COPY %3
   ; CHECK-NEXT:   B %bb.2
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.1.entry.out_crit_edge (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.1.entry.out_crit_edge (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.3(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gpr32all = COPY %3
@@ -460,7 +460,7 @@ define i32 @dont_split6(i32 %0) {
   ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:gpr32all = COPY %4
   ; CHECK-NEXT:   B %bb.3
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.2.loop.loop_crit_edge (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.2.loop.loop_crit_edge (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.1(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:gpr32all = COPY %4
@@ -495,7 +495,7 @@ define i32 @split_me4() {
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gpr32all = COPY %3
   ; CHECK-NEXT:   B %bb.2
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.1.entry.same_crit_edge (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.1.entry.same_crit_edge (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.2(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gpr32all = COPY %3
@@ -526,7 +526,7 @@ define i32 @split_me5() {
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gpr32all = COPY %3
   ; CHECK-NEXT:   B %bb.2
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.1.entry.same_crit_edge (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.1.entry.same_crit_edge (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.2(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gpr32all = COPY %3
diff --git a/llvm/test/CodeGen/PowerPC/callbr-asm-outputs-indirect-isel.ll b/llvm/test/CodeGen/PowerPC/callbr-asm-outputs-indirect-isel.ll
index 987b8da75ccf0..d52a9dbc577a8 100644
--- a/llvm/test/CodeGen/PowerPC/callbr-asm-outputs-indirect-isel.ll
+++ b/llvm/test/CodeGen/PowerPC/callbr-asm-outputs-indirect-isel.ll
@@ -22,7 +22,7 @@ define void @strncpy_from_kernel_nofault_count() {
   ; CHECK-NEXT: bb.2.Efault:
   ; CHECK-NEXT:   BLR8 implicit $lr8, implicit $rm
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT: bb.3.Efault.split (machine-block-address-taken, inlineasm-br-indirect-target):
+  ; CHECK-NEXT: bb.3.Efault.split (inlineasm-br-indirect-target):
   ; CHECK-NEXT:   successors: %bb.2(0x80000000)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   STB %1, 0, $zero8 :: (store (s8) into `ptr null`)
diff --git a/llvm/test/CodeGen/PowerPC/ppc64-inlineasm-clobber.ll b/llvm/test/CodeGen/PowerPC/ppc64-inlineasm-clobber.ll
index 57bc882f6046e..3316f1b0b87a3 100644
--- a/llvm/test/CodeGen/PowerPC/ppc64-inlineasm-clobber.ll
+++ b/llvm/test/CodeGen/PowerPC/ppc64-inlineasm-clobber.ll
@@ -86,7 +86,7 @@ define dso_local signext i32 @ClobberLR_BR(i32 signext %in) #0 {
 ; PPC64LE-NEXT:    ld r0, 16(r1)
 ; PPC64LE-NEXT:    mtlr r0
 ; PPC64LE-NEXT:    blr
-; PPC64LE-NEXT:  .LBB3_2: # Block address taken
+; PPC64LE-NEXT:  .LBB3_2: # Inline asm indirect target
 ; PPC64LE-NEXT:    # %return_early
 ; PPC64LE-NEXT:    # Label of block must be emitted
 ; PPC64LE-NEXT:    li r3, 0
@@ -105,7 +105,7 @@ define dso_local signext i32 @ClobberLR_BR(i32 signext %in) #0 {
 ; PPC64BE-NEXT:    ld r0, 16(r1)
 ; PPC64BE-NEXT:    mtlr r0
 ; PPC64BE-NEXT:    blr
-; PPC64BE-NEXT:  .LBB3_2: # Block address taken
+; PPC64BE-NEXT:  .LBB3_2: # Inline asm indirect target
 ; PPC64BE-NEXT:    # %return_early
 ; PPC64BE-NEXT:    # Label of block must be emitted
 ; PPC64BE-NEXT:    li r3, 0
@@ -130,7 +130,7 @@ define dso_local signext i32 @ClobberR5_BR(i32 signext %in) #0 {
 ; PPC64LE-NEXT:    #NO_APP
 ; PPC64LE-NEXT:  # %bb.1: # %return
 ; PPC64LE-NEXT:    blr
-; PPC64LE-NEXT:  .LBB4_2: # Block address taken
+; PPC64LE-NEXT:  .LBB4_2: # Inline asm indirect target
 ; PPC64LE-NEXT:    # %return_early
 ; PPC64LE-NEXT:    # Label of block must be emitted
 ; PPC64LE-NEXT:    li r3, 0
@@ -143,7 +143,7 @@ define dso_local signext i32 @ClobberR5_BR(i32 signex...
[truncated]

@mrutland-arm
Copy link

I've given this a test, building a v6.15 defconfig arm64 Linux kernel with LLVM built with and without this change, and the reduction is impressive:

before after change
BTI J instructions 23,323 4,207 -19,116 (-82%)
.text size (bytes) 20,418,121 20,343,417 -74,704 (-0.37%)
Image size (bytes) 39,195,136 39,129,600 -65,536 (-0.17%)
vmlinux size (bytes) 411,181,384 409,784,800 -1,396,584 (-0.34%)

I've also given that kernel a basic boot test on a model with BTI support, and all looks well so far.

I do not current have BTI-capable hardware, so I'm not able to get performance figures. Regardless, I think this is pretty clearly a win.

@statham-arm
Copy link
Collaborator Author

Wow, thanks for the proactive testing!

Copy link
Collaborator

@smithp35 smithp35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks in line with what was discussed in the Discourse thread.

I've got a small number of comment and documentation suggestions but nothing code related.

@@ -3399,7 +3399,10 @@ void SelectionDAGBuilder::visitCallBr(const CallBrInst &I) {
BasicBlock *Dest = I.getIndirectDest(i);
MachineBasicBlock *Target = FuncInfo.getMBB(Dest);
Target->setIsInlineAsmBrIndirectTarget();
Target->setMachineBlockAddressTaken();
// If there was a type of asm goto statement that was permitted to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment makes sense in the context of this review. I'm not sure if I would understand it without.

I'd be tempted to remove it, relying on someone implementing a future extension to look at this patch to work out what to do. Alternatively it could just be a small rewording. Using "was" implies past tense to me, but I think this comment is about what to do if another type of asm goto is added, which would be in the future.

// If there was a type of asm goto statement that was permitted to
// use an indirect call instruction to jump to its labels, then we
// would also have to call Target->setMachineBlockAddressTaken()
// here to mark the target block as requiring a BTI.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a newline to imply that the comment isn't applying to the next line.

WithColor::warning()
<< "block #" << BBID
<< " is a branch target of an 'asm goto' in function "
<< MF.getName() << "\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't handle anonymous functions correctly, in this context probably should use the mangled name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was surely true already of the "has its machine block address taken" clause just above, which I duplicated to make this version with the changed wording?

Are you asking me to do that unrelated fix as part of this same commit? It seems to me that if it needs fixing at all it should be done separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous warning was broken, and this one is too. We probably should remove or rename getName, it's usually not appropriate to use outside of debugger contexts. Can fix separately

Copy link
Collaborator

@smithp35 smithp35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I've resolved my comments.

I've asked internally to see if I can find anyone else that would like to comment. If I don't get any takers I'll set approved from my side later in the week.

Copy link
Member

@MaskRay MaskRay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@statham-arm statham-arm merged commit 56acb06 into llvm:main Jun 3, 2025
12 checks passed
@statham-arm statham-arm deleted the asm-goto-vs-bti branch June 3, 2025 07:44
sallto pushed a commit to sallto/llvm-project that referenced this pull request Jun 3, 2025
In 'asm goto' statements ('callbr' in LLVM IR), you can specify one or
more labels / basic blocks in the containing function which the assembly
code might jump to. If you're also compiling with branch target
enforcement via BTI, then previously listing a basic block as a possible
jump destination of an asm goto would cause a BTI instruction to be
placed at the start of the block, in case the assembly code used an
_indirect_ branch instruction (i.e. to a destination address read from a
register) to jump to that location. Now it doesn't do that any more:
branches to destination labels from the assembly code are assumed to be
direct branches (to a relative offset encoded in the instruction), which
don't require a BTI at their destination.

This change was proposed in https://discourse.llvm.org/t/85845 and there
seemed to be no disagreement. The rationale is:

1. it brings clang's handling of asm goto in Arm and AArch64 in line
with gcc's, which didn't generate BTIs at the target labels in the first
place.

2. it improves performance in the Linux kernel, which uses a lot of 'asm
goto' in which the assembly language just contains a NOP, and the
label's address is saved elsewhere to let the kernel self-modify at run
time to swap between the original NOP and a direct branch to the label.
This allows hot code paths to be instrumented for debugging, at only the
cost of a NOP when the instrumentation is turned off, instead of the
larger cost of an indirect branch. In this situation a BTI is
unnecessary (if the branch happens it's direct), and since the code
paths are hot, also a noticeable performance hit.

Implementation:

`SelectionDAGBuilder::visitCallBr` is the place where 'asm goto' target
labels are handled. It calls `setIsInlineAsmBrIndirectTarget()` on each
target `MachineBasicBlock`. Previously it also called
`setMachineBlockAddressTaken()`, which made `hasAddressTaken()` return
true, which caused a BTI to be added in the Arm backends.

Now `visitCallBr` doesn't call `setMachineBlockAddressTaken()` any more
on asm goto targets, but `hasAddressTaken()` also checks the flag set by
`setIsInlineAsmBrIndirectTarget()`. So call sites that were using
`hasAddressTaken()` don't need to be modified. But the Arm backends
don't call `hasAddressTaken()` any more: instead they test two more
specific query functions that cover all the reasons `hasAddressTaken()`
might have returned true _except_ being an asm goto target.

Testing:

The new test `AArch64/callbr-asm-label-bti.ll` is testing the actual
change, where it expects not to see a `bti` instruction after
`[[LABEL]]`. The rest of the test changes are all churn, due to the
flags on basic blocks changing. Actual output code hasn't changed in any
of the existing tests, only comments and diagnostics.

Further work:

`RISCVIndirectBranchTracking.cpp` and `X86IndirectBranchTracking.cpp`
also call `hasAddressTaken()` in a way that might benefit from using the
same more specific check I've put in `ARMBranchTargets.cpp` and
`AArch64BranchTargets.cpp`. But I'm not sure of that, so in this commit
I've only changed the Arm backends, and left those alone.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AArch64 backend:ARM backend:PowerPC backend:X86 clang Clang issues not falling into any other category llvm:ir llvm:SelectionDAG SelectionDAGISel as well
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants