[TailDuplicator] Do not restrict the computed gotos #114990

dianqk · 2024-11-05T13:56:28Z

This is what I learned from GCC. I found that GCC does not duplicate the BB that has indirect jumps with the jump table. I believe GCC has provided a clear explanation here:

Duplicate the blocks containing computed gotos. This basically unfactors computed gotos that were factored early on in the compilation process to speed up edge based data flow. We used to not unfactor them again, which can seriously pessimize code with many computed jumps in the source code, such as interpreters.

llvmbot · 2024-11-05T13:57:03Z

@llvm/pr-subscribers-backend-x86

Author: DianQK (DianQK)

Changes

Fixes #106846.

This is what I learned from GCC. I found that GCC does not duplicate the BB that has indirect jumps with the jump table. I believe GCC has provided a clear explanation here:

> Duplicate the blocks containing computed gotos. This basically unfactors computed gotos that were factored early on in the compilation process to speed up edge based data flow. We used to not unfactor them again, which can seriously pessimize code with many computed jumps in the source code, such as interpreters.

https://github.com/gcc-mirror/gcc/blob/7f67acf60c5429895d7c9e5df81796753e2913e0/gcc/bb-reorder.cc#L2757-L2761

Full diff: https://github.com/llvm/llvm-project/pull/114990.diff

3 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineInstr.h (+6-2)
(modified) llvm/lib/CodeGen/TailDuplicator.cpp (+12-10)
(added) llvm/test/CodeGen/X86/tail-dup-computed-goto.mir (+310)

diff --git a/llvm/include/llvm/CodeGen/MachineInstr.h b/llvm/include/llvm/CodeGen/MachineInstr.h
index ead6bbe1d5f641..6d268628ec14b4 100644
--- a/llvm/include/llvm/CodeGen/MachineInstr.h
+++ b/llvm/include/llvm/CodeGen/MachineInstr.h
@@ -986,8 +986,12 @@ class MachineInstr
 
   /// Return true if this is an indirect branch, such as a
   /// branch through a register.
-  bool isIndirectBranch(QueryType Type = AnyInBundle) const {
-    return hasProperty(MCID::IndirectBranch, Type);
+  bool isIndirectBranch(QueryType Type = AnyInBundle,
+                        bool IncludeJumpTable = true) const {
+    return hasProperty(MCID::IndirectBranch, Type) &&
+           (IncludeJumpTable || !llvm::any_of(operands(), [](const auto &Op) {
+              return Op.isJTI();
+            }));
   }
 
   /// Return true if this is a branch which may fall
diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp
index 3f2e1511d403a0..988c58beac20f6 100644
--- a/llvm/lib/CodeGen/TailDuplicator.cpp
+++ b/llvm/lib/CodeGen/TailDuplicator.cpp
@@ -603,17 +603,19 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
       TailBB.canFallThrough())
     return false;
 
-  // If the target has hardware branch prediction that can handle indirect
-  // branches, duplicating them can often make them predictable when there
-  // are common paths through the code.  The limit needs to be high enough
-  // to allow undoing the effects of tail merging and other optimizations
-  // that rearrange the predecessors of the indirect branch.
-
-  bool HasIndirectbr = false;
+  // Only duplicate the blocks containing computed gotos. This basically
+  // unfactors computed gotos that were factored early on in the compilation
+  // process to speed up edge based data flow. If we do not unfactor them again,
+  // it can seriously pessimize code with many computed jumps in the source
+  // code, such as interpreters.
+  bool HasComputedGoto = false;
   if (!TailBB.empty())
-    HasIndirectbr = TailBB.back().isIndirectBranch();
+    HasComputedGoto = TailBB.back().isIndirectBranch(
+        /*Type=*/MachineInstr::AnyInBundle,
+        // Jump tables are not considered computed gotos.
+        /*IncludeJumpTable=*/false);
 
-  if (HasIndirectbr && PreRegAlloc)
+  if (HasComputedGoto && PreRegAlloc)
     MaxDuplicateCount = TailDupIndirectBranchSize;
 
   // Check the instructions in the block to determine whether tail-duplication
@@ -685,7 +687,7 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
     }
   }
 
-  if (HasIndirectbr && PreRegAlloc)
+  if (HasComputedGoto && PreRegAlloc)
     return true;
 
   if (IsSimple)
diff --git a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir
new file mode 100644
index 00000000000000..b1c699c11f4619
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir
@@ -0,0 +1,310 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=early-tailduplication -tail-dup-size=0 %s -o - | FileCheck %s
+# Check that only the computed goto is duplicated.
+--- |
+  declare i64 @f0()
+  declare i64 @f1()
+  declare i64 @f2()
+  declare i64 @f3()
+  declare i64 @f4()
+  declare i64 @f5()
+  @computed_goto.dispatch = external global [5 x ptr]
+  define void @computed_goto() { ret void }
+  define void @jump_table() { ret void }
+...
+---
+name:            computed_goto
+alignment:       16
+tracksRegLiveness: true
+noPhis:          false
+isSSA:           true
+noVRegs:         false
+hasFakeUses:     false
+debugInstrRef:   true
+registers:
+  - { id: 0, class: gr64 }
+  - { id: 1, class: gr64 }
+  - { id: 2, class: gr64 }
+  - { id: 3, class: gr64 }
+  - { id: 4, class: gr64 }
+  - { id: 5, class: gr64_nosp }
+  - { id: 6, class: gr64 }
+  - { id: 7, class: gr64 }
+  - { id: 8, class: gr64 }
+  - { id: 9, class: gr64 }
+  - { id: 10, class: gr64 }
+frameInfo:
+  maxAlignment:    1
+  adjustsStack:    true
+  hasCalls:        true
+machineFunctionInfo:
+  amxProgModel:    None
+body:             |
+  ; CHECK-LABEL: name: computed_goto
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]]
+  ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY1]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY4:%[0-9]+]]:gr64_nosp = COPY [[COPY3]]
+  ; CHECK-NEXT:   [[COPY5:%[0-9]+]]:gr64_nosp = COPY [[COPY4]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY4]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY6:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY7:%[0-9]+]]:gr64_nosp = COPY [[COPY6]]
+  ; CHECK-NEXT:   [[COPY8:%[0-9]+]]:gr64_nosp = COPY [[COPY7]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY7]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.3:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY9:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY10:%[0-9]+]]:gr64_nosp = COPY [[COPY9]]
+  ; CHECK-NEXT:   [[COPY11:%[0-9]+]]:gr64_nosp = COPY [[COPY10]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY10]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.4:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY12:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY13:%[0-9]+]]:gr64_nosp = COPY [[COPY12]]
+  ; CHECK-NEXT:   [[COPY14:%[0-9]+]]:gr64_nosp = COPY [[COPY13]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY13]], @computed_goto.dispatch, $noreg
+  bb.0:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %6:gr64 = COPY $rax
+    %0:gr64 = COPY %6
+    JMP_1 %bb.5
+
+  bb.1:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %10:gr64 = COPY $rax
+    %1:gr64 = COPY %10
+    JMP_1 %bb.5
+
+  bb.2:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %9:gr64 = COPY $rax
+    %2:gr64 = COPY %9
+    JMP_1 %bb.5
+
+  bb.3:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %8:gr64 = COPY $rax
+    %3:gr64 = COPY %8
+    JMP_1 %bb.5
+
+  bb.4:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %7:gr64 = COPY $rax
+    %4:gr64 = COPY %7
+
+  bb.5:
+    successors: %bb.1, %bb.2, %bb.3, %bb.4
+
+    %5:gr64_nosp = PHI %0, %bb.0, %4, %bb.4, %3, %bb.3, %2, %bb.2, %1, %bb.1
+    JMP64m $noreg, 8, %5, @computed_goto.dispatch, $noreg
+
+...
+---
+name:            jump_table
+alignment:       16
+tracksRegLiveness: true
+noPhis:          false
+isSSA:           true
+noVRegs:         false
+hasFakeUses:     false
+debugInstrRef:   true
+registers:
+  - { id: 0, class: gr64 }
+  - { id: 1, class: gr64 }
+  - { id: 2, class: gr64 }
+  - { id: 3, class: gr64 }
+  - { id: 4, class: gr64 }
+  - { id: 5, class: gr64 }
+  - { id: 6, class: gr64 }
+  - { id: 7, class: gr64 }
+  - { id: 8, class: gr64_nosp }
+  - { id: 9, class: gr64 }
+  - { id: 10, class: gr64 }
+  - { id: 11, class: gr64 }
+  - { id: 12, class: gr64 }
+  - { id: 13, class: gr64 }
+frameInfo:
+  maxAlignment:    1
+  adjustsStack:    true
+  hasCalls:        true
+machineFunctionInfo:
+  amxProgModel:    None
+jumpTable:
+  kind:            block-address
+  entries:
+    - id:              0
+      blocks:          [ '%bb.2', '%bb.3', '%bb.4', '%bb.5', '%bb.6' ]
+body:             |
+  ; CHECK-LABEL: name: jump_table
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gr64 = COPY [[COPY]]
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.2(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:gr64 = PHI [[COPY1]], %bb.0, %6, %bb.7, %5, %bb.6, %4, %bb.5, %3, %bb.4, %2, %bb.3
+  ; CHECK-NEXT:   [[DEC64r:%[0-9]+]]:gr64_nosp = DEC64r [[PHI]], implicit-def dead $eflags
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   successors: %bb.3(0x1999999a), %bb.4(0x1999999a), %bb.5(0x1999999a), %bb.6(0x1999999a), %bb.7(0x1999999a)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[DEC64r]], %jump-table.0, $noreg :: (load (s64) from jump-table)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.3:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:gr64 = COPY [[COPY2]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.4:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY4:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY5:%[0-9]+]]:gr64 = COPY [[COPY4]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.5:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY6:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY7:%[0-9]+]]:gr64 = COPY [[COPY6]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.6:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY8:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY9:%[0-9]+]]:gr64 = COPY [[COPY8]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.7:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f5, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY10:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY11:%[0-9]+]]:gr64 = COPY [[COPY10]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.8:
+  bb.0:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %7:gr64 = COPY $rax
+    %0:gr64 = COPY %7
+
+  bb.1:
+    %1:gr64 = PHI %0, %bb.0, %6, %bb.6, %5, %bb.5, %4, %bb.4, %3, %bb.3, %2, %bb.2
+    %8:gr64_nosp = DEC64r %1, implicit-def dead $eflags
+
+  bb.8:
+    successors: %bb.2(0x1999999a), %bb.3(0x1999999a), %bb.4(0x1999999a), %bb.5(0x1999999a), %bb.6(0x1999999a)
+
+    JMP64m $noreg, 8, %8, %jump-table.0, $noreg :: (load (s64) from jump-table)
+
+  bb.2:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %13:gr64 = COPY $rax
+    %2:gr64 = COPY %13
+    JMP_1 %bb.1
+
+  bb.3:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %12:gr64 = COPY $rax
+    %3:gr64 = COPY %12
+    JMP_1 %bb.1
+
+  bb.4:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %11:gr64 = COPY $rax
+    %4:gr64 = COPY %11
+    JMP_1 %bb.1
+
+  bb.5:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %10:gr64 = COPY $rax
+    %5:gr64 = COPY %10
+    JMP_1 %bb.1
+
+  bb.6:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f5, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %9:gr64 = COPY $rax
+    %6:gr64 = COPY %9
+    JMP_1 %bb.1
+
+  bb.7:
+
+...

dianqk · 2024-11-05T13:57:11Z

I will revert #78582 after this PR is merged.

efriedma-quic · 2024-11-05T21:27:54Z

Please don't link to gcc source code in commit messages.

I think it might make sense to try to pursue a more general solution here... can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something. But special-casing indirectbr seems okay short-term.

dianqk · 2024-11-06T04:47:19Z

Please don't link to gcc source code in commit messages.

Removed.

I think it might make sense to try to pursue a more general solution here... can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something. But special-casing indirectbr seems okay short-term.

I think the current IR is already as you want: https://llvm.godbolt.org/z/o5YT191P9.
There’s only a single merged indirectbr here.

efriedma-quic · 2024-11-06T21:57:00Z

I think the current IR is already as you want: https://llvm.godbolt.org/z/o5YT191P9.
There’s only a single merged indirectbr here.

Currently, taildup means we end up with O(N^2) edges, even in the indirectbr case. This patch is just deciding that we're willing to pay that cost in the indirectbr case, because it's a strong hint that that the code is important for performance. The question is, can we rework the backend representation to allow taildup without the explosion in the number of edges?

dianqk · 2024-11-08T01:25:16Z

I think the current IR is already as you want: https://llvm.godbolt.org/z/o5YT191P9.
There’s only a single merged indirectbr here.

Currently, taildup means we end up with O(N^2) edges, even in the indirectbr case. This patch is just deciding that we're willing to pay that cost in the indirectbr case, because it's a strong hint that that the code is important for performance. The question is, can we rework the backend representation to allow taildup without the explosion in the number of edges?

Sound makes sense to me. I can try to address it, although the process can be slow.

https://www.ajla-lang.cz/tutorial.html pointed out several issues:

It is recommended to use gcc &mdash the compilation will take several minutes and it will consume about 4GB memory when compiling the files ipret.c and ipretc.c. Compilation with clang works, but it is very slow, it may take an hour or more to compile the files ipret.c and ipretc.c.

dianqk · 2024-12-03T13:38:58Z

ping :p

bzEq · 2024-12-03T14:44:49Z

can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something.

The idea sounds good. Maybe outlined as

bb.0:
  succ: bb.0, bb.1, bb.2
  indirectbr %0
bb.1:
  succ: bb.0, bb.1, bb.2
  indirectbr %1
bb.2:
  succ: bb.0, bb.1, bb.2
  indirectbr %2

=>

bb.0:
  succ: bb.3
  indirectbr %0
bb.1:
  succ: bb.3
  indirectbr %1
bb.2:
  succ: bb.3
  indirectbr %2
bb.3:
  succ: bb.0, bb.1, bb.2
  %3 = phi(%0, %1, %2)
  PSEUDO_INDIRECT_BR %3

dianqk · 2024-12-17T01:51:24Z

can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something.

The idea sounds good. Maybe outlined as
bb.0:
  succ: bb.0, bb.1, bb.2
  indirectbr %0
bb.1:
  succ: bb.0, bb.1, bb.2
  indirectbr %1
bb.2:
  succ: bb.0, bb.1, bb.2
  indirectbr %2
=>
bb.0:
  succ: bb.3
  indirectbr %0
bb.1:
  succ: bb.3
  indirectbr %1
bb.2:
  succ: bb.3
  indirectbr %2
bb.3:
  succ: bb.0, bb.1, bb.2
  %3 = phi(%0, %1, %2)
  PSEUDO_INDIRECT_BR %3

I don't fully understand the details of this improvement yet, but it shouldn't block this PR.

So.. ping

fhahn · 2025-01-22T22:56:42Z

Just getting back to this, sorry! Is this the latest version of the patch that should fix the regressions?

I tried the patch when building Python on macOS and it improves performance by ~0.5% while with #116072 increases performance by 2-3%

dianqk · 2025-01-23T13:43:00Z

Just getting back to this, sorry! Is this the latest version of the patch that should fix the regressions?

I tried the patch when building Python on macOS and it improves performance by ~0.5% while with #116072 increases performance by 2-3%

I don't have a CPU like the i7-2640M, but I can observe a significant decrease in instructions:u from the perf stat output.

perf stat -r 3 ./ajla --nosave loop.ajla 1000000000
    98,475,825,975      instructions:u                   #    2.95  insn per cycle
perf stat -r 3 ./ajla --nosave loop.ajla 1000000000
    77,379,026,755      instructions:u                   #    2.09  insn per cycle

This should help CPUs with weaker branch prediction capabilities.

I will rebase this PR after #116072 land.

dianqk · 2025-01-24T15:23:17Z

Rebased.
cc @fhahn

dianqk · 2025-03-06T14:29:19Z

Ping ~ (From #106846 (comment))

The vendored patches were produced from the latest versions of the following PRs: * llvm/llvm-project#114990 * llvm/llvm-project#120267 The first improves codegen for computed gotos. There was a regression in LLVM 19 causing a ~10% performance drop in CPython. The second enables BOLT to work with computed gotos. This enables BOLT to accomplish more on CPython.

dianqk · 2025-03-10T09:16:08Z

@dtcxzyw @nikic @fhahn

Ping, I believe this issue is quietly and broadly affecting interpreters implemented using computed gotos. Compared to Clang 18, this shouldn't introduce new regressions, and I think we can move on this a bit more quickly.

nikic

LGTM

Fixes llvm#106846. This is what I learned from GCC. I found that GCC does not duplicate the BB that has indirect jumps with the jump table. I believe GCC has provided a clear explanation here: > Duplicate the blocks containing computed gotos. This basically unfactors computed gotos that were factored early on in the compilation process to speed up edge based data flow. We used to not unfactor them again, which can seriously pessimize code with many computed jumps in the source code, such as interpreters. (cherry picked from commit dd21aac)

bgra8 · 2025-03-20T15:12:45Z

We (at google) have bisected a huge increase in memory use during compilation (for some specific source files) at this revision.

The previous version compiles those files using less than 4GB of memory while at this revision the compiler exceeds 16GB (didn't try to use larger memory limits)

Specifically the clang process eventually runs out of memory when executed in a shell with ulimit -Sd 16291456

The crashing stack due to out of memory shows:

1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module '<module name redacted>'.
4.      Running pass 'Early Tail Duplication' on function '@<function name redacted>'

Working on a reproducer.

LATER EDIT: before this change the compilation used under 500MB of memory. So the memory blowout is considerably larger for these cases.

dianqk · 2025-03-21T01:29:40Z

We (at google) have bisected a huge increase in memory use during compilation (for some specific source files) at this revision.

The previous version compiles those files using less than 4GB of memory while at this revision the compiler exceeds 16GB (didn't try to use larger memory limits)

Specifically the clang process eventually runs out of memory when executed in a shell with ulimit -Sd 16291456

The crashing stack due to out of memory shows:
1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module '<module name redacted>'.
4.      Running pass 'Early Tail Duplication' on function '@<function name redacted>'
Working on a reproducer.

LATER EDIT: before this change the compilation used under 500MB of memory. So the memory blowout is considerably larger for these cases.

Could you also try Clang 18?

bgra8 · 2025-03-21T08:52:08Z

Could you also try Clang 18?

Tried at c416b2e and the non-reduced test case does not exceed 512M memory.

dianqk · 2025-03-21T09:45:41Z

Could you also try Clang 18?

Tried at c416b2e and the non-reduced test case does not exceed 512M memory.

Could you reproduce -mllvm -tail-dup-pred-size=100000 before this PR? Since this PR is just a re-enabled option., I think you should bisect it again. Thanks!

bgra8 · 2025-03-21T11:16:46Z

Could you reproduce -mllvm -tail-dup-pred-size=100000 before this PR?

Yes, it reproduces with that setting. So now we know the issue is most likely caused by circumventing of the tail-dup-pred-size check for computed gotos in this PR. So this is the correct culprit for the issue we're seeing.

It is very likely that it was a known fact that tail-dup-pred-size causes similar issue and that's why it was configured with a number (I confirmed that for the case I'm looking at 128 is the value for this parameter that leads to the compiler to increase the memory usage significantly over 512MB)

So circumventing the check for tail-dup-pred-size now opens the door to the same class of issues that the argument was supposed to control.

Can you please add another feature to allow controlling this circumventing for computed gotos?

dianqk · 2025-03-21T11:46:22Z

It is very likely that it was a known fact that tail-dup-pred-size causes similar issue and that's why it was configured with a number.

That PR landed at LLVM 19 was also sent by me. w
I overlooked the difference between jump table and computed gotos in that PR.
So I think this regression should be caused by some changes between these two PRs.

So circumventing the check for tail-dup-pred-size now opens the door to the same class of issues that the argument was supposed to control.

Can you please add another feature to allow controlling this circumventing for computed gotos?

As I mentioned in the PR description, the key difference is that we expect the original intent and performance in computed gotos.

alexfh · 2025-03-21T14:34:13Z

We've found another problem after this commit: the compile time for a number of translation units (protobuf generated, which means a lot of large switch statements) grew from under 5 seconds to "I have yet to see it finish" (it's been running for more than an hour at this point). And the issue seems to be rather widespread in our codebase. So even if the actual problem is elsewhere (which may well be the case, given the analysis here), this commit unfortunately is triggering it without any good workaround AFAIU.

I'm working on a shareable reproducer now, but it looks like we need a revert or a workaround (different than disabling optimization completely, which is not an option) soon. It could be in the shape of keeping the old behavior under a command line option, for example.

If it helps, stack traces captured during the never-ending compilation look mostly like this:

  * frame #0: 0x000055555c32fa0c clang`hasSameSuccessors(llvm::MachineBasicBlock&, llvm::SmallPtrSetImpl<llvm::MachineBasicBlock const*>&) [inlined] llvm::SmallPtrSetImplBase::contains_imp(this=0x00007fffffff3710, Ptr=<unavailable>) const at SmallPtrSet.h:230:24
    frame #1: 0x000055555c32f9d3 clang`hasSameSuccessors(llvm::MachineBasicBlock&, llvm::SmallPtrSetImpl<llvm::MachineBasicBlock const*>&) [inlined] llvm::SmallPtrSetImpl<llvm::MachineBasicBlock const*>::count(this=0x00007fffffff3710, Ptr=<unavailable>) const at SmallPtrSet.h:453:12
    frame #2: 0x000055555c32f9d3 clang`hasSameSuccessors(BB=0x0000505fbb0b3a20, Successors=0x00007fffffff3710) at MachineBlockPlacement.cpp:827:21
    frame #3: 0x000055555c32e70a clang`(anonymous namespace)::MachineBlockPlacement::canTailDuplicateUnplacedPreds(this=0x00007fffffff3f80, BB=0x0000505fbc019120, Succ=0x0000505fbcca0fc0, Chain=0x0000505f8b25e000, BlockFilter=<unavailable>) at MachineBlockPlacement.cpp:1220:36
    frame #4: 0x000055555c32cbb1 clang`(anonymous namespace)::MachineBlockPlacement::buildChain(llvm::MachineBasicBlock const*, (anonymous namespace)::BlockChain&, llvm::SmallSetVector<llvm::MachineBasicBlock const*, 16u>*) [inlined] (anonymous namespace)::MachineBlockPlacement::selectBestSuccessor(this=0x00007fffffff3f80, BB=0x0000505fbc019120, Chain=0x0000505f8b25e000, BlockFilter=0x0000000000000000) at MachineBlockPlacement.cpp:1729:9
    frame #5: 0x000055555c32cb32 clang`(anonymous namespace)::MachineBlockPlacement::buildChain(this=0x00007fffffff3f80, HeadBB=0x0000505fbbf2d5e8, Chain=0x0000505f8b25e000, BlockFilter=0x0000000000000000) at MachineBlockPlacement.cpp:1918:19
    frame #6: 0x000055555c327539 clang`(anonymous namespace)::MachineBlockPlacement::buildCFGChains(this=0x00007fffffff3f80) at MachineBlockPlacement.cpp:2826:3
    frame #7: 0x000055555c324328 clang`(anonymous namespace)::MachineBlockPlacement::run(this=0x00007fffffff3f80, MF=0x0000505fbd572d00) at MachineBlockPlacement.cpp:3606:5
    frame #8: 0x000055555c326ee7 clang`(anonymous namespace)::MachineBlockPlacementLegacy::runOnMachineFunction(this=0x0000505fbe193dc0, MF=0x0000505fbd572d00) at MachineBlockPlacement.cpp:657:10
    frame #9: 0x000055555c377fcc clang`llvm::MachineFunctionPass::runOnFunction(this=0x0000505fbe193dc0, F=0x0000505fbe929ef8) at MachineFunctionPass.cpp:108:10
    frame #10: 0x000055555d818135 clang`llvm::FPPassManager::runOnFunction(this=0x0000505fbfe08380, F=0x0000505fbe929ef8) at LegacyPassManager.cpp:1406:27
    frame #11: 0x000055555d81ef7d clang`llvm::FPPassManager::runOnModule(this=0x0000505fbfe08380, M=<unavailable>) at LegacyPassManager.cpp:1452:16

bgra8 · 2025-03-21T14:52:05Z

Here's the reproducer for the OOM issue: repro.cc

With clang before this PR run this to show the compilation does not need more than 150MB memory:

# $(ulimit -Sd 150000;  \
   clang.before -cc1 -triple x86_64-generic-linux-gnu -emit-obj -target-cpu x86-64 \
     -O1 -std=gnu++20 \
     -o /tmp/repro.o  \
     /tmp/repro.cc)

With clang at this PR run this to show the compilation exceeds 600MB memory use:

#  $(ulimit -Sd 650000;  \
      clang -cc1 -triple x86_64-generic-linux-gnu -emit-obj -target-cpu x86-64 \
         -O1 -std=gnu++20 \
         -o /tmp/repro.o \
         /tmp/repro.cc)

Please revert or add a way to avoid this issue (basically what @alexfh requested too)

alexfh · 2025-03-21T17:24:13Z

test.tar.gz (or https://gcc.godbolt.org/z/rfrsc7GaE)

The test case for compilation time regression. This is only reproducible with -fPIE for some reason. It also looks like the time grows superlinearly from the size, so sufficiently large protobuf-generated .cc file can actually compile for hours.

$ ./clang-good -O3 -fPIE -c reduced.ll -o /dev/null -ftime-report
...
===-------------------------------------------------------------------------===
                               Clang time report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.8198 seconds (0.8198 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.7694 ( 96.3%)   0.0204 ( 99.2%)   0.7898 ( 96.3%)   0.7898 ( 96.3%)  Machine code generation
   0.0181 (  2.3%)   0.0002 (  0.8%)   0.0183 (  2.2%)   0.0183 (  2.2%)  Optimizer
   0.0117 (  1.5%)   0.0000 (  0.0%)   0.0117 (  1.4%)   0.0117 (  1.4%)  Front end
   0.7992 (100.0%)   0.0206 (100.0%)   0.8198 (100.0%)   0.8198 (100.0%)  Total

$ ./clang-bad -O3 -fPIE -c reduced.ll -o /dev/null -ftime-report
...
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 18.7632 seconds (18.7641 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   6.3520 ( 34.0%)   0.0007 (  0.7%)   6.3527 ( 33.9%)   6.3529 ( 33.9%)  Machine Cycle Info Analysis
   3.5076 ( 18.8%)   0.0000 (  0.0%)   3.5076 ( 18.7%)   3.5078 ( 18.7%)  Control Flow Optimizer
   2.3329 ( 12.5%)   0.0000 (  0.0%)   2.3329 ( 12.4%)   2.3330 ( 12.4%)  Branch Probability Basic Block Placement
   1.7856 (  9.6%)   0.0000 (  0.0%)   1.7856 (  9.5%)   1.7857 (  9.5%)  Check CFA info and insert CFI instructions if needed
   0.9386 (  5.0%)   0.0000 (  0.0%)   0.9386 (  5.0%)   0.9386 (  5.0%)  Machine code sinking
   0.8447 (  4.5%)   0.0000 (  0.0%)   0.8447 (  4.5%)   0.8448 (  4.5%)  ReachingDefAnalysis
   0.3791 (  2.0%)   0.0106 ( 10.4%)   0.3897 (  2.1%)   0.3897 (  2.1%)  Early Tail Duplication
   0.2533 (  1.4%)   0.0000 (  0.0%)   0.2533 (  1.4%)   0.2534 (  1.4%)  Machine Block Frequency Analysis #5
   0.2513 (  1.3%)   0.0000 (  0.0%)   0.2513 (  1.3%)   0.2513 (  1.3%)  Machine Block Frequency Analysis #2
   0.2470 (  1.3%)   0.0000 (  0.0%)   0.2470 (  1.3%)   0.2470 (  1.3%)  Machine Block Frequency Analysis #3
   0.2465 (  1.3%)   0.0000 (  0.0%)   0.2465 (  1.3%)   0.2465 (  1.3%)  Machine Block Frequency Analysis #4
   0.2359 (  1.3%)   0.0079 (  7.7%)   0.2439 (  1.3%)   0.2439 (  1.3%)  Machine Block Frequency Analysis
   0.1544 (  0.8%)   0.0120 ( 11.7%)   0.1664 (  0.9%)   0.1664 (  0.9%)  MachinePostDominator Tree Construction #3
   0.1488 (  0.8%)   0.0120 ( 11.7%)   0.1608 (  0.9%)   0.1608 (  0.9%)  MachinePostDominator Tree Construction
   0.1513 (  0.8%)   0.0040 (  3.9%)   0.1553 (  0.8%)   0.1553 (  0.8%)  MachinePostDominator Tree Construction #2
[test.tar.gz](https://github.com/user-attachments/files/19394304/test.tar.gz)

...
===-------------------------------------------------------------------------===
                               Clang time report
===-------------------------------------------------------------------------===
  Total Execution Time: 18.7990 seconds (18.8000 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  18.6630 ( 99.8%)   0.1024 ( 99.7%)  18.7653 ( 99.8%)  18.7663 ( 99.8%)  Machine code generation
   0.0227 (  0.1%)   0.0003 (  0.3%)   0.0230 (  0.1%)   0.0230 (  0.1%)  Optimizer
   0.0106 (  0.1%)   0.0000 (  0.0%)   0.0106 (  0.1%)   0.0106 (  0.1%)  Front end
  18.6963 (100.0%)   0.1027 (100.0%)  18.7990 (100.0%)  18.8000 (100.0%)  Total

This reverts commit dd21aac.

alexfh · 2025-03-21T18:22:31Z

I've sent #132431 to revert this, but if there's a good workaround (or one can be implemented quickly and reliably), that would also be an option.

dianqk · 2025-03-21T23:19:26Z

The current method for determining computed gotos is inaccurate, and I will send a PR.

dianqk · 2025-03-22T10:09:33Z

I've sent #132431 to revert this, but if there's a good workaround (or one can be implemented quickly and reliably), that would also be an option.

#132536 should fix this.

dianqk requested review from nikic, alexfh, dtcxzyw and efriedma-quic November 5, 2024 13:56

llvmbot added the backend:X86 label Nov 5, 2024

dianqk force-pushed the computed-goto branch from e319817 to 58b5757 Compare November 5, 2024 14:01

efriedma-quic mentioned this pull request Nov 13, 2024

[TailDup] Allow large number of predecessors/successors without phis. #116072

Merged

dianqk requested a review from fhahn November 19, 2024 01:29

dianqk marked this pull request as draft November 19, 2024 01:30

dianqk force-pushed the computed-goto branch from 58b5757 to 5e4c0f4 Compare November 20, 2024 14:37

dianqk marked this pull request as ready for review November 21, 2024 04:00

dianqk added 2 commits January 24, 2025 23:21

Add a computed goto test

ba44cbf

[TailDuplicator] Do not restrict the computed gotos

3acf7e7

dianqk force-pushed the computed-goto branch from 5e4c0f4 to 3acf7e7 Compare January 24, 2025 15:21

dianqk changed the title ~~[TailDuplicator] Only duplicate the blocks containing computed gotos~~ [TailDuplicator] Do not restrict the computed gotos Jan 24, 2025

dianqk mentioned this pull request Feb 13, 2025

performance regression in clang-19 when using computed goto #106846

Closed

zanieb mentioned this pull request Mar 6, 2025

Upgrade to LLVM 20 indygreg/toolchain-tools#6

Closed

nelhage mentioned this pull request Mar 6, 2025

gh-128563: A new tail-calling interpreter python/cpython#128718

Merged

nikic approved these changes Mar 10, 2025

View reviewed changes

dianqk merged commit dd21aac into llvm:main Mar 10, 2025
8 checks passed

dianqk deleted the computed-goto branch March 10, 2025 11:34

shiltian mentioned this pull request Mar 10, 2025

[AMDGPU] Fix test failures when expensive checks are enabled #130644

Merged

zanieb mentioned this pull request Mar 16, 2025

cpython 3.13 installed with UV slow and not compiled with --enable-experimental-jit=yes-off` astral-sh/python-build-standalone#535

Closed

alexfh added a commit that referenced this pull request Mar 21, 2025

Revert "[TailDuplicator] Do not restrict the computed gotos (#114990)"

56d7a0a

This reverts commit dd21aac.

alexfh mentioned this pull request Mar 21, 2025

Revert "[TailDuplicator] Do not restrict the computed gotos" #132431

Closed

[TailDuplicator] Do not restrict the computed gotos #114990

[TailDuplicator] Do not restrict the computed gotos #114990

Uh oh!

Conversation

dianqk commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 5, 2024

Uh oh!

dianqk commented Nov 5, 2024

Uh oh!

efriedma-quic commented Nov 5, 2024

Uh oh!

dianqk commented Nov 6, 2024

Uh oh!

efriedma-quic commented Nov 6, 2024

Uh oh!

dianqk commented Nov 8, 2024

Uh oh!

dianqk commented Dec 3, 2024

Uh oh!

bzEq commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dianqk commented Dec 17, 2024

Uh oh!

fhahn commented Jan 22, 2025

Uh oh!

dianqk commented Jan 23, 2025

Uh oh!

dianqk commented Jan 24, 2025

Uh oh!

dianqk commented Mar 6, 2025

Uh oh!

dianqk commented Mar 10, 2025

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bgra8 commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dianqk commented Mar 21, 2025

Uh oh!

bgra8 commented Mar 21, 2025

Uh oh!

dianqk commented Mar 21, 2025

Uh oh!

bgra8 commented Mar 21, 2025

Uh oh!

dianqk commented Mar 21, 2025

Uh oh!

alexfh commented Mar 21, 2025

Uh oh!

bgra8 commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexfh commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexfh commented Mar 21, 2025

Uh oh!

dianqk commented Mar 21, 2025

Uh oh!

dianqk commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dianqk commented Nov 5, 2024 •

edited

Loading

bzEq commented Dec 3, 2024 •

edited

Loading

bgra8 commented Mar 20, 2025 •

edited

Loading

bgra8 commented Mar 21, 2025 •

edited

Loading

alexfh commented Mar 21, 2025 •

edited

Loading

dianqk commented Mar 22, 2025 •

edited

Loading