Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TailDuplicator] Do not restrict the computed gotos #114990

Merged
merged 2 commits into from
Mar 10, 2025

Conversation

dianqk
Copy link
Member

@dianqk dianqk commented Nov 5, 2024

Fixes #106846.

This is what I learned from GCC. I found that GCC does not duplicate the BB that has indirect jumps with the jump table. I believe GCC has provided a clear explanation here:

Duplicate the blocks containing computed gotos. This basically unfactors computed gotos that were factored early on in the compilation process to speed up edge based data flow. We used to not unfactor them again, which can seriously pessimize code with many computed jumps in the source code, such as interpreters.

@llvmbot
Copy link
Member

llvmbot commented Nov 5, 2024

@llvm/pr-subscribers-backend-x86

Author: DianQK (DianQK)

Changes

Fixes #106846.

This is what I learned from GCC. I found that GCC does not duplicate the BB that has indirect jumps with the jump table. I believe GCC has provided a clear explanation here:

> Duplicate the blocks containing computed gotos. This basically unfactors computed gotos that were factored early on in the compilation process to speed up edge based data flow. We used to not unfactor them again, which can seriously pessimize code with many computed jumps in the source code, such as interpreters.

https://github.com/gcc-mirror/gcc/blob/7f67acf60c5429895d7c9e5df81796753e2913e0/gcc/bb-reorder.cc#L2757-L2761


Full diff: https://github.com/llvm/llvm-project/pull/114990.diff

3 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineInstr.h (+6-2)
  • (modified) llvm/lib/CodeGen/TailDuplicator.cpp (+12-10)
  • (added) llvm/test/CodeGen/X86/tail-dup-computed-goto.mir (+310)
diff --git a/llvm/include/llvm/CodeGen/MachineInstr.h b/llvm/include/llvm/CodeGen/MachineInstr.h
index ead6bbe1d5f641..6d268628ec14b4 100644
--- a/llvm/include/llvm/CodeGen/MachineInstr.h
+++ b/llvm/include/llvm/CodeGen/MachineInstr.h
@@ -986,8 +986,12 @@ class MachineInstr
 
   /// Return true if this is an indirect branch, such as a
   /// branch through a register.
-  bool isIndirectBranch(QueryType Type = AnyInBundle) const {
-    return hasProperty(MCID::IndirectBranch, Type);
+  bool isIndirectBranch(QueryType Type = AnyInBundle,
+                        bool IncludeJumpTable = true) const {
+    return hasProperty(MCID::IndirectBranch, Type) &&
+           (IncludeJumpTable || !llvm::any_of(operands(), [](const auto &Op) {
+              return Op.isJTI();
+            }));
   }
 
   /// Return true if this is a branch which may fall
diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp
index 3f2e1511d403a0..988c58beac20f6 100644
--- a/llvm/lib/CodeGen/TailDuplicator.cpp
+++ b/llvm/lib/CodeGen/TailDuplicator.cpp
@@ -603,17 +603,19 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
       TailBB.canFallThrough())
     return false;
 
-  // If the target has hardware branch prediction that can handle indirect
-  // branches, duplicating them can often make them predictable when there
-  // are common paths through the code.  The limit needs to be high enough
-  // to allow undoing the effects of tail merging and other optimizations
-  // that rearrange the predecessors of the indirect branch.
-
-  bool HasIndirectbr = false;
+  // Only duplicate the blocks containing computed gotos. This basically
+  // unfactors computed gotos that were factored early on in the compilation
+  // process to speed up edge based data flow. If we do not unfactor them again,
+  // it can seriously pessimize code with many computed jumps in the source
+  // code, such as interpreters.
+  bool HasComputedGoto = false;
   if (!TailBB.empty())
-    HasIndirectbr = TailBB.back().isIndirectBranch();
+    HasComputedGoto = TailBB.back().isIndirectBranch(
+        /*Type=*/MachineInstr::AnyInBundle,
+        // Jump tables are not considered computed gotos.
+        /*IncludeJumpTable=*/false);
 
-  if (HasIndirectbr && PreRegAlloc)
+  if (HasComputedGoto && PreRegAlloc)
     MaxDuplicateCount = TailDupIndirectBranchSize;
 
   // Check the instructions in the block to determine whether tail-duplication
@@ -685,7 +687,7 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
     }
   }
 
-  if (HasIndirectbr && PreRegAlloc)
+  if (HasComputedGoto && PreRegAlloc)
     return true;
 
   if (IsSimple)
diff --git a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir
new file mode 100644
index 00000000000000..b1c699c11f4619
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir
@@ -0,0 +1,310 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=early-tailduplication -tail-dup-size=0 %s -o - | FileCheck %s
+# Check that only the computed goto is duplicated.
+--- |
+  declare i64 @f0()
+  declare i64 @f1()
+  declare i64 @f2()
+  declare i64 @f3()
+  declare i64 @f4()
+  declare i64 @f5()
+  @computed_goto.dispatch = external global [5 x ptr]
+  define void @computed_goto() { ret void }
+  define void @jump_table() { ret void }
+...
+---
+name:            computed_goto
+alignment:       16
+tracksRegLiveness: true
+noPhis:          false
+isSSA:           true
+noVRegs:         false
+hasFakeUses:     false
+debugInstrRef:   true
+registers:
+  - { id: 0, class: gr64 }
+  - { id: 1, class: gr64 }
+  - { id: 2, class: gr64 }
+  - { id: 3, class: gr64 }
+  - { id: 4, class: gr64 }
+  - { id: 5, class: gr64_nosp }
+  - { id: 6, class: gr64 }
+  - { id: 7, class: gr64 }
+  - { id: 8, class: gr64 }
+  - { id: 9, class: gr64 }
+  - { id: 10, class: gr64 }
+frameInfo:
+  maxAlignment:    1
+  adjustsStack:    true
+  hasCalls:        true
+machineFunctionInfo:
+  amxProgModel:    None
+body:             |
+  ; CHECK-LABEL: name: computed_goto
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]]
+  ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY1]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY4:%[0-9]+]]:gr64_nosp = COPY [[COPY3]]
+  ; CHECK-NEXT:   [[COPY5:%[0-9]+]]:gr64_nosp = COPY [[COPY4]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY4]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY6:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY7:%[0-9]+]]:gr64_nosp = COPY [[COPY6]]
+  ; CHECK-NEXT:   [[COPY8:%[0-9]+]]:gr64_nosp = COPY [[COPY7]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY7]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.3:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY9:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY10:%[0-9]+]]:gr64_nosp = COPY [[COPY9]]
+  ; CHECK-NEXT:   [[COPY11:%[0-9]+]]:gr64_nosp = COPY [[COPY10]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY10]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.4:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY12:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY13:%[0-9]+]]:gr64_nosp = COPY [[COPY12]]
+  ; CHECK-NEXT:   [[COPY14:%[0-9]+]]:gr64_nosp = COPY [[COPY13]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY13]], @computed_goto.dispatch, $noreg
+  bb.0:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %6:gr64 = COPY $rax
+    %0:gr64 = COPY %6
+    JMP_1 %bb.5
+
+  bb.1:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %10:gr64 = COPY $rax
+    %1:gr64 = COPY %10
+    JMP_1 %bb.5
+
+  bb.2:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %9:gr64 = COPY $rax
+    %2:gr64 = COPY %9
+    JMP_1 %bb.5
+
+  bb.3:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %8:gr64 = COPY $rax
+    %3:gr64 = COPY %8
+    JMP_1 %bb.5
+
+  bb.4:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %7:gr64 = COPY $rax
+    %4:gr64 = COPY %7
+
+  bb.5:
+    successors: %bb.1, %bb.2, %bb.3, %bb.4
+
+    %5:gr64_nosp = PHI %0, %bb.0, %4, %bb.4, %3, %bb.3, %2, %bb.2, %1, %bb.1
+    JMP64m $noreg, 8, %5, @computed_goto.dispatch, $noreg
+
+...
+---
+name:            jump_table
+alignment:       16
+tracksRegLiveness: true
+noPhis:          false
+isSSA:           true
+noVRegs:         false
+hasFakeUses:     false
+debugInstrRef:   true
+registers:
+  - { id: 0, class: gr64 }
+  - { id: 1, class: gr64 }
+  - { id: 2, class: gr64 }
+  - { id: 3, class: gr64 }
+  - { id: 4, class: gr64 }
+  - { id: 5, class: gr64 }
+  - { id: 6, class: gr64 }
+  - { id: 7, class: gr64 }
+  - { id: 8, class: gr64_nosp }
+  - { id: 9, class: gr64 }
+  - { id: 10, class: gr64 }
+  - { id: 11, class: gr64 }
+  - { id: 12, class: gr64 }
+  - { id: 13, class: gr64 }
+frameInfo:
+  maxAlignment:    1
+  adjustsStack:    true
+  hasCalls:        true
+machineFunctionInfo:
+  amxProgModel:    None
+jumpTable:
+  kind:            block-address
+  entries:
+    - id:              0
+      blocks:          [ '%bb.2', '%bb.3', '%bb.4', '%bb.5', '%bb.6' ]
+body:             |
+  ; CHECK-LABEL: name: jump_table
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gr64 = COPY [[COPY]]
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.2(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:gr64 = PHI [[COPY1]], %bb.0, %6, %bb.7, %5, %bb.6, %4, %bb.5, %3, %bb.4, %2, %bb.3
+  ; CHECK-NEXT:   [[DEC64r:%[0-9]+]]:gr64_nosp = DEC64r [[PHI]], implicit-def dead $eflags
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   successors: %bb.3(0x1999999a), %bb.4(0x1999999a), %bb.5(0x1999999a), %bb.6(0x1999999a), %bb.7(0x1999999a)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[DEC64r]], %jump-table.0, $noreg :: (load (s64) from jump-table)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.3:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:gr64 = COPY [[COPY2]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.4:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY4:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY5:%[0-9]+]]:gr64 = COPY [[COPY4]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.5:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY6:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY7:%[0-9]+]]:gr64 = COPY [[COPY6]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.6:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY8:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY9:%[0-9]+]]:gr64 = COPY [[COPY8]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.7:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f5, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY10:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY11:%[0-9]+]]:gr64 = COPY [[COPY10]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.8:
+  bb.0:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %7:gr64 = COPY $rax
+    %0:gr64 = COPY %7
+
+  bb.1:
+    %1:gr64 = PHI %0, %bb.0, %6, %bb.6, %5, %bb.5, %4, %bb.4, %3, %bb.3, %2, %bb.2
+    %8:gr64_nosp = DEC64r %1, implicit-def dead $eflags
+
+  bb.8:
+    successors: %bb.2(0x1999999a), %bb.3(0x1999999a), %bb.4(0x1999999a), %bb.5(0x1999999a), %bb.6(0x1999999a)
+
+    JMP64m $noreg, 8, %8, %jump-table.0, $noreg :: (load (s64) from jump-table)
+
+  bb.2:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %13:gr64 = COPY $rax
+    %2:gr64 = COPY %13
+    JMP_1 %bb.1
+
+  bb.3:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %12:gr64 = COPY $rax
+    %3:gr64 = COPY %12
+    JMP_1 %bb.1
+
+  bb.4:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %11:gr64 = COPY $rax
+    %4:gr64 = COPY %11
+    JMP_1 %bb.1
+
+  bb.5:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %10:gr64 = COPY $rax
+    %5:gr64 = COPY %10
+    JMP_1 %bb.1
+
+  bb.6:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f5, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %9:gr64 = COPY $rax
+    %6:gr64 = COPY %9
+    JMP_1 %bb.1
+
+  bb.7:
+
+...

@dianqk
Copy link
Member Author

dianqk commented Nov 5, 2024

I will revert #78582 after this PR is merged.

@efriedma-quic
Copy link
Collaborator

Please don't link to gcc source code in commit messages.

I think it might make sense to try to pursue a more general solution here... can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something. But special-casing indirectbr seems okay short-term.

@dianqk
Copy link
Member Author

dianqk commented Nov 6, 2024

Please don't link to gcc source code in commit messages.

Removed.

I think it might make sense to try to pursue a more general solution here... can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something. But special-casing indirectbr seems okay short-term.

I think the current IR is already as you want: https://llvm.godbolt.org/z/o5YT191P9.
There’s only a single merged indirectbr here.

@efriedma-quic
Copy link
Collaborator

I think the current IR is already as you want: https://llvm.godbolt.org/z/o5YT191P9.
There’s only a single merged indirectbr here.

Currently, taildup means we end up with O(N^2) edges, even in the indirectbr case. This patch is just deciding that we're willing to pay that cost in the indirectbr case, because it's a strong hint that that the code is important for performance. The question is, can we rework the backend representation to allow taildup without the explosion in the number of edges?

@dianqk
Copy link
Member Author

dianqk commented Nov 8, 2024

I think the current IR is already as you want: https://llvm.godbolt.org/z/o5YT191P9.
There’s only a single merged indirectbr here.

Currently, taildup means we end up with O(N^2) edges, even in the indirectbr case. This patch is just deciding that we're willing to pay that cost in the indirectbr case, because it's a strong hint that that the code is important for performance. The question is, can we rework the backend representation to allow taildup without the explosion in the number of edges?

Sound makes sense to me. I can try to address it, although the process can be slow.

https://www.ajla-lang.cz/tutorial.html pointed out several issues:

It is recommended to use gcc &mdash the compilation will take several minutes and it will consume about 4GB memory when compiling the files ipret.c and ipretc.c. Compilation with clang works, but it is very slow, it may take an hour or more to compile the files ipret.c and ipretc.c.

@dianqk dianqk requested a review from fhahn November 19, 2024 01:29
@dianqk dianqk marked this pull request as draft November 19, 2024 01:30
@dianqk dianqk marked this pull request as ready for review November 21, 2024 04:00
@dianqk
Copy link
Member Author

dianqk commented Dec 3, 2024

ping :p

@bzEq
Copy link
Collaborator

bzEq commented Dec 3, 2024

can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something.

The idea sounds good. Maybe outlined as

bb.0:
  succ: bb.0, bb.1, bb.2
  indirectbr %0
bb.1:
  succ: bb.0, bb.1, bb.2
  indirectbr %1
bb.2:
  succ: bb.0, bb.1, bb.2
  indirectbr %2

=>

bb.0:
  succ: bb.3
  indirectbr %0
bb.1:
  succ: bb.3
  indirectbr %1
bb.2:
  succ: bb.3
  indirectbr %2
bb.3:
  succ: bb.0, bb.1, bb.2
  %3 = phi(%0, %1, %2)
  PSEUDO_INDIRECT_BR %3

@dianqk
Copy link
Member Author

dianqk commented Dec 17, 2024

can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something.

The idea sounds good. Maybe outlined as

bb.0:
  succ: bb.0, bb.1, bb.2
  indirectbr %0
bb.1:
  succ: bb.0, bb.1, bb.2
  indirectbr %1
bb.2:
  succ: bb.0, bb.1, bb.2
  indirectbr %2

=>

bb.0:
  succ: bb.3
  indirectbr %0
bb.1:
  succ: bb.3
  indirectbr %1
bb.2:
  succ: bb.3
  indirectbr %2
bb.3:
  succ: bb.0, bb.1, bb.2
  %3 = phi(%0, %1, %2)
  PSEUDO_INDIRECT_BR %3

I don't fully understand the details of this improvement yet, but it shouldn't block this PR.

So.. ping

@fhahn
Copy link
Contributor

fhahn commented Jan 22, 2025

Just getting back to this, sorry! Is this the latest version of the patch that should fix the regressions?

I tried the patch when building Python on macOS and it improves performance by ~0.5% while with #116072 increases performance by 2-3%

@dianqk
Copy link
Member Author

dianqk commented Jan 23, 2025

Just getting back to this, sorry! Is this the latest version of the patch that should fix the regressions?

I tried the patch when building Python on macOS and it improves performance by ~0.5% while with #116072 increases performance by 2-3%

I don't have a CPU like the i7-2640M, but I can observe a significant decrease in instructions:u from the perf stat output.

perf stat -r 3 ./ajla --nosave loop.ajla 1000000000
    98,475,825,975      instructions:u                   #    2.95  insn per cycle
perf stat -r 3 ./ajla --nosave loop.ajla 1000000000
    77,379,026,755      instructions:u                   #    2.09  insn per cycle

This should help CPUs with weaker branch prediction capabilities.

I will rebase this PR after #116072 land.

@dianqk dianqk changed the title [TailDuplicator] Only duplicate the blocks containing computed gotos [TailDuplicator] Do not restrict the computed gotos Jan 24, 2025
@dianqk
Copy link
Member Author

dianqk commented Jan 24, 2025

Rebased.
cc @fhahn

@dianqk
Copy link
Member Author

dianqk commented Mar 6, 2025

Ping ~ (From #106846 (comment))

indygreg added a commit to indygreg/toolchain-tools that referenced this pull request Mar 8, 2025
The vendored patches were produced from the latest versions of the
following PRs:

* llvm/llvm-project#114990
* llvm/llvm-project#120267

The first improves codegen for computed gotos. There was a
regression in LLVM 19 causing a ~10% performance drop in CPython.

The second enables BOLT to work with computed gotos. This enables
BOLT to accomplish more on CPython.
indygreg added a commit to indygreg/toolchain-tools that referenced this pull request Mar 8, 2025
The vendored patches were produced from the latest versions of the
following PRs:

* llvm/llvm-project#114990
* llvm/llvm-project#120267

The first improves codegen for computed gotos. There was a
regression in LLVM 19 causing a ~10% performance drop in CPython.

The second enables BOLT to work with computed gotos. This enables
BOLT to accomplish more on CPython.
indygreg added a commit to indygreg/toolchain-tools that referenced this pull request Mar 9, 2025
The vendored patches were produced from the latest versions of the
following PRs:

* llvm/llvm-project#114990
* llvm/llvm-project#120267

The first improves codegen for computed gotos. There was a
regression in LLVM 19 causing a ~10% performance drop in CPython.

The second enables BOLT to work with computed gotos. This enables
BOLT to accomplish more on CPython.
indygreg added a commit to indygreg/toolchain-tools that referenced this pull request Mar 9, 2025
The vendored patches were produced from the latest versions of the
following PRs:

* llvm/llvm-project#114990
* llvm/llvm-project#120267

The first improves codegen for computed gotos. There was a
regression in LLVM 19 causing a ~10% performance drop in CPython.

The second enables BOLT to work with computed gotos. This enables
BOLT to accomplish more on CPython.
indygreg added a commit to indygreg/toolchain-tools that referenced this pull request Mar 9, 2025
The vendored patches were produced from the latest versions of the
following PRs:

* llvm/llvm-project#114990
* llvm/llvm-project#120267

The first improves codegen for computed gotos. There was a
regression in LLVM 19 causing a ~10% performance drop in CPython.

The second enables BOLT to work with computed gotos. This enables
BOLT to accomplish more on CPython.
@dianqk
Copy link
Member Author

dianqk commented Mar 10, 2025

@dtcxzyw @nikic @fhahn

Ping, I believe this issue is quietly and broadly affecting interpreters implemented using computed gotos. Compared to Clang 18, this shouldn't introduce new regressions, and I think we can move on this a bit more quickly.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dianqk dianqk merged commit dd21aac into llvm:main Mar 10, 2025
8 checks passed
@dianqk dianqk deleted the computed-goto branch March 10, 2025 11:34
swift-ci pushed a commit to swiftlang/llvm-project that referenced this pull request Mar 11, 2025
Fixes llvm#106846.

This is what I learned from GCC. I found that GCC does not duplicate the
BB that has indirect jumps with the jump table. I believe GCC has
provided a clear explanation here:

> Duplicate the blocks containing computed gotos. This basically
unfactors computed gotos that were factored early on in the compilation
process to speed up edge based data flow. We used to not unfactor them
again, which can seriously pessimize code with many computed jumps in
the source code, such as interpreters.

(cherry picked from commit dd21aac)
@bgra8
Copy link
Contributor

bgra8 commented Mar 20, 2025

We (at google) have bisected a huge increase in memory use during compilation (for some specific source files) at this revision.

The previous version compiles those files using less than 4GB of memory while at this revision the compiler exceeds 16GB (didn't try to use larger memory limits)

Specifically the clang process eventually runs out of memory when executed in a shell with ulimit -Sd 16291456

The crashing stack due to out of memory shows:

1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module '<module name redacted>'.
4.      Running pass 'Early Tail Duplication' on function '@<function name redacted>'

Working on a reproducer.

LATER EDIT: before this change the compilation used under 500MB of memory. So the memory blowout is considerably larger for these cases.

@dianqk
Copy link
Member Author

dianqk commented Mar 21, 2025

We (at google) have bisected a huge increase in memory use during compilation (for some specific source files) at this revision.

The previous version compiles those files using less than 4GB of memory while at this revision the compiler exceeds 16GB (didn't try to use larger memory limits)

Specifically the clang process eventually runs out of memory when executed in a shell with ulimit -Sd 16291456

The crashing stack due to out of memory shows:

1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module '<module name redacted>'.
4.      Running pass 'Early Tail Duplication' on function '@<function name redacted>'

Working on a reproducer.

LATER EDIT: before this change the compilation used under 500MB of memory. So the memory blowout is considerably larger for these cases.

Could you also try Clang 18?

@bgra8
Copy link
Contributor

bgra8 commented Mar 21, 2025

Could you also try Clang 18?

Tried at c416b2e and the non-reduced test case does not exceed 512M memory.

@dianqk
Copy link
Member Author

dianqk commented Mar 21, 2025

Could you also try Clang 18?

Tried at c416b2e and the non-reduced test case does not exceed 512M memory.

Could you reproduce -mllvm -tail-dup-pred-size=100000 before this PR? Since this PR is just a re-enabled option., I think you should bisect it again. Thanks!

@bgra8
Copy link
Contributor

bgra8 commented Mar 21, 2025

Could you reproduce -mllvm -tail-dup-pred-size=100000 before this PR?

Yes, it reproduces with that setting. So now we know the issue is most likely caused by circumventing of the tail-dup-pred-size check for computed gotos in this PR. So this is the correct culprit for the issue we're seeing.

It is very likely that it was a known fact that tail-dup-pred-size causes similar issue and that's why it was configured with a number (I confirmed that for the case I'm looking at 128 is the value for this parameter that leads to the compiler to increase the memory usage significantly over 512MB)

So circumventing the check for tail-dup-pred-size now opens the door to the same class of issues that the argument was supposed to control.

Can you please add another feature to allow controlling this circumventing for computed gotos?

@dianqk
Copy link
Member Author

dianqk commented Mar 21, 2025

It is very likely that it was a known fact that tail-dup-pred-size causes similar issue and that's why it was configured with a number.

That PR landed at LLVM 19 was also sent by me. w
I overlooked the difference between jump table and computed gotos in that PR.
So I think this regression should be caused by some changes between these two PRs.

So circumventing the check for tail-dup-pred-size now opens the door to the same class of issues that the argument was supposed to control.

Can you please add another feature to allow controlling this circumventing for computed gotos?

As I mentioned in the PR description, the key difference is that we expect the original intent and performance in computed gotos.

@alexfh
Copy link
Contributor

alexfh commented Mar 21, 2025

We've found another problem after this commit: the compile time for a number of translation units (protobuf generated, which means a lot of large switch statements) grew from under 5 seconds to "I have yet to see it finish" (it's been running for more than an hour at this point). And the issue seems to be rather widespread in our codebase. So even if the actual problem is elsewhere (which may well be the case, given the analysis here), this commit unfortunately is triggering it without any good workaround AFAIU.

I'm working on a shareable reproducer now, but it looks like we need a revert or a workaround (different than disabling optimization completely, which is not an option) soon. It could be in the shape of keeping the old behavior under a command line option, for example.

If it helps, stack traces captured during the never-ending compilation look mostly like this:

  * frame #0: 0x000055555c32fa0c clang`hasSameSuccessors(llvm::MachineBasicBlock&, llvm::SmallPtrSetImpl<llvm::MachineBasicBlock const*>&) [inlined] llvm::SmallPtrSetImplBase::contains_imp(this=0x00007fffffff3710, Ptr=<unavailable>) const at SmallPtrSet.h:230:24
    frame #1: 0x000055555c32f9d3 clang`hasSameSuccessors(llvm::MachineBasicBlock&, llvm::SmallPtrSetImpl<llvm::MachineBasicBlock const*>&) [inlined] llvm::SmallPtrSetImpl<llvm::MachineBasicBlock const*>::count(this=0x00007fffffff3710, Ptr=<unavailable>) const at SmallPtrSet.h:453:12
    frame #2: 0x000055555c32f9d3 clang`hasSameSuccessors(BB=0x0000505fbb0b3a20, Successors=0x00007fffffff3710) at MachineBlockPlacement.cpp:827:21
    frame #3: 0x000055555c32e70a clang`(anonymous namespace)::MachineBlockPlacement::canTailDuplicateUnplacedPreds(this=0x00007fffffff3f80, BB=0x0000505fbc019120, Succ=0x0000505fbcca0fc0, Chain=0x0000505f8b25e000, BlockFilter=<unavailable>) at MachineBlockPlacement.cpp:1220:36
    frame #4: 0x000055555c32cbb1 clang`(anonymous namespace)::MachineBlockPlacement::buildChain(llvm::MachineBasicBlock const*, (anonymous namespace)::BlockChain&, llvm::SmallSetVector<llvm::MachineBasicBlock const*, 16u>*) [inlined] (anonymous namespace)::MachineBlockPlacement::selectBestSuccessor(this=0x00007fffffff3f80, BB=0x0000505fbc019120, Chain=0x0000505f8b25e000, BlockFilter=0x0000000000000000) at MachineBlockPlacement.cpp:1729:9
    frame #5: 0x000055555c32cb32 clang`(anonymous namespace)::MachineBlockPlacement::buildChain(this=0x00007fffffff3f80, HeadBB=0x0000505fbbf2d5e8, Chain=0x0000505f8b25e000, BlockFilter=0x0000000000000000) at MachineBlockPlacement.cpp:1918:19
    frame #6: 0x000055555c327539 clang`(anonymous namespace)::MachineBlockPlacement::buildCFGChains(this=0x00007fffffff3f80) at MachineBlockPlacement.cpp:2826:3
    frame #7: 0x000055555c324328 clang`(anonymous namespace)::MachineBlockPlacement::run(this=0x00007fffffff3f80, MF=0x0000505fbd572d00) at MachineBlockPlacement.cpp:3606:5
    frame #8: 0x000055555c326ee7 clang`(anonymous namespace)::MachineBlockPlacementLegacy::runOnMachineFunction(this=0x0000505fbe193dc0, MF=0x0000505fbd572d00) at MachineBlockPlacement.cpp:657:10
    frame #9: 0x000055555c377fcc clang`llvm::MachineFunctionPass::runOnFunction(this=0x0000505fbe193dc0, F=0x0000505fbe929ef8) at MachineFunctionPass.cpp:108:10
    frame #10: 0x000055555d818135 clang`llvm::FPPassManager::runOnFunction(this=0x0000505fbfe08380, F=0x0000505fbe929ef8) at LegacyPassManager.cpp:1406:27
    frame #11: 0x000055555d81ef7d clang`llvm::FPPassManager::runOnModule(this=0x0000505fbfe08380, M=<unavailable>) at LegacyPassManager.cpp:1452:16

@bgra8
Copy link
Contributor

bgra8 commented Mar 21, 2025

Here's the reproducer for the OOM issue: repro.cc

With clang before this PR run this to show the compilation does not need more than 150MB memory:

# $(ulimit -Sd 150000;  \
   clang.before -cc1 -triple x86_64-generic-linux-gnu -emit-obj -target-cpu x86-64 \
     -O1 -std=gnu++20 \
     -o /tmp/repro.o  \
     /tmp/repro.cc) 

With clang at this PR run this to show the compilation exceeds 600MB memory use:

#  $(ulimit -Sd 650000;  \
      clang -cc1 -triple x86_64-generic-linux-gnu -emit-obj -target-cpu x86-64 \
         -O1 -std=gnu++20 \
         -o /tmp/repro.o \
         /tmp/repro.cc)

Please revert or add a way to avoid this issue (basically what @alexfh requested too)

@alexfh
Copy link
Contributor

alexfh commented Mar 21, 2025

test.tar.gz (or https://gcc.godbolt.org/z/rfrsc7GaE)

The test case for compilation time regression. This is only reproducible with -fPIE for some reason. It also looks like the time grows superlinearly from the size, so sufficiently large protobuf-generated .cc file can actually compile for hours.

$ ./clang-good -O3 -fPIE -c reduced.ll -o /dev/null -ftime-report
...
===-------------------------------------------------------------------------===
                               Clang time report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.8198 seconds (0.8198 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.7694 ( 96.3%)   0.0204 ( 99.2%)   0.7898 ( 96.3%)   0.7898 ( 96.3%)  Machine code generation
   0.0181 (  2.3%)   0.0002 (  0.8%)   0.0183 (  2.2%)   0.0183 (  2.2%)  Optimizer
   0.0117 (  1.5%)   0.0000 (  0.0%)   0.0117 (  1.4%)   0.0117 (  1.4%)  Front end
   0.7992 (100.0%)   0.0206 (100.0%)   0.8198 (100.0%)   0.8198 (100.0%)  Total
$ ./clang-bad -O3 -fPIE -c reduced.ll -o /dev/null -ftime-report
...
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 18.7632 seconds (18.7641 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   6.3520 ( 34.0%)   0.0007 (  0.7%)   6.3527 ( 33.9%)   6.3529 ( 33.9%)  Machine Cycle Info Analysis
   3.5076 ( 18.8%)   0.0000 (  0.0%)   3.5076 ( 18.7%)   3.5078 ( 18.7%)  Control Flow Optimizer
   2.3329 ( 12.5%)   0.0000 (  0.0%)   2.3329 ( 12.4%)   2.3330 ( 12.4%)  Branch Probability Basic Block Placement
   1.7856 (  9.6%)   0.0000 (  0.0%)   1.7856 (  9.5%)   1.7857 (  9.5%)  Check CFA info and insert CFI instructions if needed
   0.9386 (  5.0%)   0.0000 (  0.0%)   0.9386 (  5.0%)   0.9386 (  5.0%)  Machine code sinking
   0.8447 (  4.5%)   0.0000 (  0.0%)   0.8447 (  4.5%)   0.8448 (  4.5%)  ReachingDefAnalysis
   0.3791 (  2.0%)   0.0106 ( 10.4%)   0.3897 (  2.1%)   0.3897 (  2.1%)  Early Tail Duplication
   0.2533 (  1.4%)   0.0000 (  0.0%)   0.2533 (  1.4%)   0.2534 (  1.4%)  Machine Block Frequency Analysis #5
   0.2513 (  1.3%)   0.0000 (  0.0%)   0.2513 (  1.3%)   0.2513 (  1.3%)  Machine Block Frequency Analysis #2
   0.2470 (  1.3%)   0.0000 (  0.0%)   0.2470 (  1.3%)   0.2470 (  1.3%)  Machine Block Frequency Analysis #3
   0.2465 (  1.3%)   0.0000 (  0.0%)   0.2465 (  1.3%)   0.2465 (  1.3%)  Machine Block Frequency Analysis #4
   0.2359 (  1.3%)   0.0079 (  7.7%)   0.2439 (  1.3%)   0.2439 (  1.3%)  Machine Block Frequency Analysis
   0.1544 (  0.8%)   0.0120 ( 11.7%)   0.1664 (  0.9%)   0.1664 (  0.9%)  MachinePostDominator Tree Construction #3
   0.1488 (  0.8%)   0.0120 ( 11.7%)   0.1608 (  0.9%)   0.1608 (  0.9%)  MachinePostDominator Tree Construction
   0.1513 (  0.8%)   0.0040 (  3.9%)   0.1553 (  0.8%)   0.1553 (  0.8%)  MachinePostDominator Tree Construction #2
[test.tar.gz](https://github.com/user-attachments/files/19394304/test.tar.gz)

...
===-------------------------------------------------------------------------===
                               Clang time report
===-------------------------------------------------------------------------===
  Total Execution Time: 18.7990 seconds (18.8000 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  18.6630 ( 99.8%)   0.1024 ( 99.7%)  18.7653 ( 99.8%)  18.7663 ( 99.8%)  Machine code generation
   0.0227 (  0.1%)   0.0003 (  0.3%)   0.0230 (  0.1%)   0.0230 (  0.1%)  Optimizer
   0.0106 (  0.1%)   0.0000 (  0.0%)   0.0106 (  0.1%)   0.0106 (  0.1%)  Front end
  18.6963 (100.0%)   0.1027 (100.0%)  18.7990 (100.0%)  18.8000 (100.0%)  Total

@alexfh
Copy link
Contributor

alexfh commented Mar 21, 2025

I've sent #132431 to revert this, but if there's a good workaround (or one can be implemented quickly and reliably), that would also be an option.

@dianqk
Copy link
Member Author

dianqk commented Mar 21, 2025

The current method for determining computed gotos is inaccurate, and I will send a PR.

@dianqk
Copy link
Member Author

dianqk commented Mar 22, 2025

I've sent #132431 to revert this, but if there's a good workaround (or one can be implemented quickly and reliably), that would also be an option.

#132536 should fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

performance regression in clang-19 when using computed goto
8 participants