-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TailDuplicator] Do not restrict the computed gotos #114990
Conversation
@llvm/pr-subscribers-backend-x86 Author: DianQK (DianQK) ChangesFixes #106846. This is what I learned from GCC. I found that GCC does not duplicate the BB that has indirect jumps with the jump table. I believe GCC has provided a clear explanation here: > Duplicate the blocks containing computed gotos. This basically unfactors computed gotos that were factored early on in the compilation process to speed up edge based data flow. We used to not unfactor them again, which can seriously pessimize code with many computed jumps in the source code, such as interpreters. Full diff: https://github.com/llvm/llvm-project/pull/114990.diff 3 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/MachineInstr.h b/llvm/include/llvm/CodeGen/MachineInstr.h
index ead6bbe1d5f641..6d268628ec14b4 100644
--- a/llvm/include/llvm/CodeGen/MachineInstr.h
+++ b/llvm/include/llvm/CodeGen/MachineInstr.h
@@ -986,8 +986,12 @@ class MachineInstr
/// Return true if this is an indirect branch, such as a
/// branch through a register.
- bool isIndirectBranch(QueryType Type = AnyInBundle) const {
- return hasProperty(MCID::IndirectBranch, Type);
+ bool isIndirectBranch(QueryType Type = AnyInBundle,
+ bool IncludeJumpTable = true) const {
+ return hasProperty(MCID::IndirectBranch, Type) &&
+ (IncludeJumpTable || !llvm::any_of(operands(), [](const auto &Op) {
+ return Op.isJTI();
+ }));
}
/// Return true if this is a branch which may fall
diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp
index 3f2e1511d403a0..988c58beac20f6 100644
--- a/llvm/lib/CodeGen/TailDuplicator.cpp
+++ b/llvm/lib/CodeGen/TailDuplicator.cpp
@@ -603,17 +603,19 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
TailBB.canFallThrough())
return false;
- // If the target has hardware branch prediction that can handle indirect
- // branches, duplicating them can often make them predictable when there
- // are common paths through the code. The limit needs to be high enough
- // to allow undoing the effects of tail merging and other optimizations
- // that rearrange the predecessors of the indirect branch.
-
- bool HasIndirectbr = false;
+ // Only duplicate the blocks containing computed gotos. This basically
+ // unfactors computed gotos that were factored early on in the compilation
+ // process to speed up edge based data flow. If we do not unfactor them again,
+ // it can seriously pessimize code with many computed jumps in the source
+ // code, such as interpreters.
+ bool HasComputedGoto = false;
if (!TailBB.empty())
- HasIndirectbr = TailBB.back().isIndirectBranch();
+ HasComputedGoto = TailBB.back().isIndirectBranch(
+ /*Type=*/MachineInstr::AnyInBundle,
+ // Jump tables are not considered computed gotos.
+ /*IncludeJumpTable=*/false);
- if (HasIndirectbr && PreRegAlloc)
+ if (HasComputedGoto && PreRegAlloc)
MaxDuplicateCount = TailDupIndirectBranchSize;
// Check the instructions in the block to determine whether tail-duplication
@@ -685,7 +687,7 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
}
}
- if (HasIndirectbr && PreRegAlloc)
+ if (HasComputedGoto && PreRegAlloc)
return true;
if (IsSimple)
diff --git a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir
new file mode 100644
index 00000000000000..b1c699c11f4619
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir
@@ -0,0 +1,310 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=early-tailduplication -tail-dup-size=0 %s -o - | FileCheck %s
+# Check that only the computed goto is duplicated.
+--- |
+ declare i64 @f0()
+ declare i64 @f1()
+ declare i64 @f2()
+ declare i64 @f3()
+ declare i64 @f4()
+ declare i64 @f5()
+ @computed_goto.dispatch = external global [5 x ptr]
+ define void @computed_goto() { ret void }
+ define void @jump_table() { ret void }
+...
+---
+name: computed_goto
+alignment: 16
+tracksRegLiveness: true
+noPhis: false
+isSSA: true
+noVRegs: false
+hasFakeUses: false
+debugInstrRef: true
+registers:
+ - { id: 0, class: gr64 }
+ - { id: 1, class: gr64 }
+ - { id: 2, class: gr64 }
+ - { id: 3, class: gr64 }
+ - { id: 4, class: gr64 }
+ - { id: 5, class: gr64_nosp }
+ - { id: 6, class: gr64 }
+ - { id: 7, class: gr64 }
+ - { id: 8, class: gr64 }
+ - { id: 9, class: gr64 }
+ - { id: 10, class: gr64 }
+frameInfo:
+ maxAlignment: 1
+ adjustsStack: true
+ hasCalls: true
+machineFunctionInfo:
+ amxProgModel: None
+body: |
+ ; CHECK-LABEL: name: computed_goto
+ ; CHECK: bb.0:
+ ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64 = COPY $rax
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]]
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]]
+ ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY1]], @computed_goto.dispatch, $noreg
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.1:
+ ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64 = COPY $rax
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64_nosp = COPY [[COPY3]]
+ ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gr64_nosp = COPY [[COPY4]]
+ ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY4]], @computed_goto.dispatch, $noreg
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.2:
+ ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: [[COPY6:%[0-9]+]]:gr64 = COPY $rax
+ ; CHECK-NEXT: [[COPY7:%[0-9]+]]:gr64_nosp = COPY [[COPY6]]
+ ; CHECK-NEXT: [[COPY8:%[0-9]+]]:gr64_nosp = COPY [[COPY7]]
+ ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY7]], @computed_goto.dispatch, $noreg
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.3:
+ ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: [[COPY9:%[0-9]+]]:gr64 = COPY $rax
+ ; CHECK-NEXT: [[COPY10:%[0-9]+]]:gr64_nosp = COPY [[COPY9]]
+ ; CHECK-NEXT: [[COPY11:%[0-9]+]]:gr64_nosp = COPY [[COPY10]]
+ ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY10]], @computed_goto.dispatch, $noreg
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.4:
+ ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: [[COPY12:%[0-9]+]]:gr64 = COPY $rax
+ ; CHECK-NEXT: [[COPY13:%[0-9]+]]:gr64_nosp = COPY [[COPY12]]
+ ; CHECK-NEXT: [[COPY14:%[0-9]+]]:gr64_nosp = COPY [[COPY13]]
+ ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY13]], @computed_goto.dispatch, $noreg
+ bb.0:
+ ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ %6:gr64 = COPY $rax
+ %0:gr64 = COPY %6
+ JMP_1 %bb.5
+
+ bb.1:
+ ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ %10:gr64 = COPY $rax
+ %1:gr64 = COPY %10
+ JMP_1 %bb.5
+
+ bb.2:
+ ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ %9:gr64 = COPY $rax
+ %2:gr64 = COPY %9
+ JMP_1 %bb.5
+
+ bb.3:
+ ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ %8:gr64 = COPY $rax
+ %3:gr64 = COPY %8
+ JMP_1 %bb.5
+
+ bb.4:
+ ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ %7:gr64 = COPY $rax
+ %4:gr64 = COPY %7
+
+ bb.5:
+ successors: %bb.1, %bb.2, %bb.3, %bb.4
+
+ %5:gr64_nosp = PHI %0, %bb.0, %4, %bb.4, %3, %bb.3, %2, %bb.2, %1, %bb.1
+ JMP64m $noreg, 8, %5, @computed_goto.dispatch, $noreg
+
+...
+---
+name: jump_table
+alignment: 16
+tracksRegLiveness: true
+noPhis: false
+isSSA: true
+noVRegs: false
+hasFakeUses: false
+debugInstrRef: true
+registers:
+ - { id: 0, class: gr64 }
+ - { id: 1, class: gr64 }
+ - { id: 2, class: gr64 }
+ - { id: 3, class: gr64 }
+ - { id: 4, class: gr64 }
+ - { id: 5, class: gr64 }
+ - { id: 6, class: gr64 }
+ - { id: 7, class: gr64 }
+ - { id: 8, class: gr64_nosp }
+ - { id: 9, class: gr64 }
+ - { id: 10, class: gr64 }
+ - { id: 11, class: gr64 }
+ - { id: 12, class: gr64 }
+ - { id: 13, class: gr64 }
+frameInfo:
+ maxAlignment: 1
+ adjustsStack: true
+ hasCalls: true
+machineFunctionInfo:
+ amxProgModel: None
+jumpTable:
+ kind: block-address
+ entries:
+ - id: 0
+ blocks: [ '%bb.2', '%bb.3', '%bb.4', '%bb.5', '%bb.6' ]
+body: |
+ ; CHECK-LABEL: name: jump_table
+ ; CHECK: bb.0:
+ ; CHECK-NEXT: successors: %bb.1(0x80000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64 = COPY $rax
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64 = COPY [[COPY]]
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.1:
+ ; CHECK-NEXT: successors: %bb.2(0x80000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[PHI:%[0-9]+]]:gr64 = PHI [[COPY1]], %bb.0, %6, %bb.7, %5, %bb.6, %4, %bb.5, %3, %bb.4, %2, %bb.3
+ ; CHECK-NEXT: [[DEC64r:%[0-9]+]]:gr64_nosp = DEC64r [[PHI]], implicit-def dead $eflags
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.2:
+ ; CHECK-NEXT: successors: %bb.3(0x1999999a), %bb.4(0x1999999a), %bb.5(0x1999999a), %bb.6(0x1999999a), %bb.7(0x1999999a)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: JMP64m $noreg, 8, [[DEC64r]], %jump-table.0, $noreg :: (load (s64) from jump-table)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.3:
+ ; CHECK-NEXT: successors: %bb.1(0x80000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64 = COPY $rax
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64 = COPY [[COPY2]]
+ ; CHECK-NEXT: JMP_1 %bb.1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.4:
+ ; CHECK-NEXT: successors: %bb.1(0x80000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64 = COPY $rax
+ ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gr64 = COPY [[COPY4]]
+ ; CHECK-NEXT: JMP_1 %bb.1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.5:
+ ; CHECK-NEXT: successors: %bb.1(0x80000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: [[COPY6:%[0-9]+]]:gr64 = COPY $rax
+ ; CHECK-NEXT: [[COPY7:%[0-9]+]]:gr64 = COPY [[COPY6]]
+ ; CHECK-NEXT: JMP_1 %bb.1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.6:
+ ; CHECK-NEXT: successors: %bb.1(0x80000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: [[COPY8:%[0-9]+]]:gr64 = COPY $rax
+ ; CHECK-NEXT: [[COPY9:%[0-9]+]]:gr64 = COPY [[COPY8]]
+ ; CHECK-NEXT: JMP_1 %bb.1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.7:
+ ; CHECK-NEXT: successors: %bb.1(0x80000000)
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f5, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ ; CHECK-NEXT: [[COPY10:%[0-9]+]]:gr64 = COPY $rax
+ ; CHECK-NEXT: [[COPY11:%[0-9]+]]:gr64 = COPY [[COPY10]]
+ ; CHECK-NEXT: JMP_1 %bb.1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: bb.8:
+ bb.0:
+ ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ %7:gr64 = COPY $rax
+ %0:gr64 = COPY %7
+
+ bb.1:
+ %1:gr64 = PHI %0, %bb.0, %6, %bb.6, %5, %bb.5, %4, %bb.4, %3, %bb.3, %2, %bb.2
+ %8:gr64_nosp = DEC64r %1, implicit-def dead $eflags
+
+ bb.8:
+ successors: %bb.2(0x1999999a), %bb.3(0x1999999a), %bb.4(0x1999999a), %bb.5(0x1999999a), %bb.6(0x1999999a)
+
+ JMP64m $noreg, 8, %8, %jump-table.0, $noreg :: (load (s64) from jump-table)
+
+ bb.2:
+ ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ %13:gr64 = COPY $rax
+ %2:gr64 = COPY %13
+ JMP_1 %bb.1
+
+ bb.3:
+ ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ %12:gr64 = COPY $rax
+ %3:gr64 = COPY %12
+ JMP_1 %bb.1
+
+ bb.4:
+ ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ %11:gr64 = COPY $rax
+ %4:gr64 = COPY %11
+ JMP_1 %bb.1
+
+ bb.5:
+ ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ %10:gr64 = COPY $rax
+ %5:gr64 = COPY %10
+ JMP_1 %bb.1
+
+ bb.6:
+ ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ CALL64pcrel32 target-flags(x86-plt) @f5, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+ ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+ %9:gr64 = COPY $rax
+ %6:gr64 = COPY %9
+ JMP_1 %bb.1
+
+ bb.7:
+
+...
|
I will revert #78582 after this PR is merged. |
Please don't link to gcc source code in commit messages. I think it might make sense to try to pursue a more general solution here... can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something. But special-casing indirectbr seems okay short-term. |
Removed.
I think the current IR is already as you want: https://llvm.godbolt.org/z/o5YT191P9. |
Currently, taildup means we end up with O(N^2) edges, even in the indirectbr case. This patch is just deciding that we're willing to pay that cost in the indirectbr case, because it's a strong hint that that the code is important for performance. The question is, can we rework the backend representation to allow taildup without the explosion in the number of edges? |
Sound makes sense to me. I can try to address it, although the process can be slow. https://www.ajla-lang.cz/tutorial.html pointed out several issues:
|
58b5757
to
5e4c0f4
Compare
ping :p |
The idea sounds good. Maybe outlined as
=>
|
I don't fully understand the details of this improvement yet, but it shouldn't block this PR. So.. ping |
Just getting back to this, sorry! Is this the latest version of the patch that should fix the regressions? I tried the patch when building Python on macOS and it improves performance by ~0.5% while with #116072 increases performance by 2-3% |
I don't have a CPU like the i7-2640M, but I can observe a significant decrease in
This should help CPUs with weaker branch prediction capabilities. I will rebase this PR after #116072 land. |
Rebased. |
Ping ~ (From #106846 (comment)) |
The vendored patches were produced from the latest versions of the following PRs: * llvm/llvm-project#114990 * llvm/llvm-project#120267 The first improves codegen for computed gotos. There was a regression in LLVM 19 causing a ~10% performance drop in CPython. The second enables BOLT to work with computed gotos. This enables BOLT to accomplish more on CPython.
The vendored patches were produced from the latest versions of the following PRs: * llvm/llvm-project#114990 * llvm/llvm-project#120267 The first improves codegen for computed gotos. There was a regression in LLVM 19 causing a ~10% performance drop in CPython. The second enables BOLT to work with computed gotos. This enables BOLT to accomplish more on CPython.
The vendored patches were produced from the latest versions of the following PRs: * llvm/llvm-project#114990 * llvm/llvm-project#120267 The first improves codegen for computed gotos. There was a regression in LLVM 19 causing a ~10% performance drop in CPython. The second enables BOLT to work with computed gotos. This enables BOLT to accomplish more on CPython.
The vendored patches were produced from the latest versions of the following PRs: * llvm/llvm-project#114990 * llvm/llvm-project#120267 The first improves codegen for computed gotos. There was a regression in LLVM 19 causing a ~10% performance drop in CPython. The second enables BOLT to work with computed gotos. This enables BOLT to accomplish more on CPython.
The vendored patches were produced from the latest versions of the following PRs: * llvm/llvm-project#114990 * llvm/llvm-project#120267 The first improves codegen for computed gotos. There was a regression in LLVM 19 causing a ~10% performance drop in CPython. The second enables BOLT to work with computed gotos. This enables BOLT to accomplish more on CPython.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes llvm#106846. This is what I learned from GCC. I found that GCC does not duplicate the BB that has indirect jumps with the jump table. I believe GCC has provided a clear explanation here: > Duplicate the blocks containing computed gotos. This basically unfactors computed gotos that were factored early on in the compilation process to speed up edge based data flow. We used to not unfactor them again, which can seriously pessimize code with many computed jumps in the source code, such as interpreters. (cherry picked from commit dd21aac)
We (at google) have bisected a huge increase in memory use during compilation (for some specific source files) at this revision. The previous version compiles those files using less than 4GB of memory while at this revision the compiler exceeds 16GB (didn't try to use larger memory limits) Specifically the The crashing stack due to out of memory shows:
Working on a reproducer. LATER EDIT: before this change the compilation used under 500MB of memory. So the memory blowout is considerably larger for these cases. |
Could you also try Clang 18? |
Tried at c416b2e and the non-reduced test case does not exceed 512M memory. |
Could you reproduce |
Yes, it reproduces with that setting. So now we know the issue is most likely caused by circumventing of the It is very likely that it was a known fact that So circumventing the check for Can you please add another feature to allow controlling this circumventing for computed gotos? |
That PR landed at LLVM 19 was also sent by me. w
As I mentioned in the PR description, the key difference is that we expect the original intent and performance in computed gotos. |
We've found another problem after this commit: the compile time for a number of translation units (protobuf generated, which means a lot of large switch statements) grew from under 5 seconds to "I have yet to see it finish" (it's been running for more than an hour at this point). And the issue seems to be rather widespread in our codebase. So even if the actual problem is elsewhere (which may well be the case, given the analysis here), this commit unfortunately is triggering it without any good workaround AFAIU. I'm working on a shareable reproducer now, but it looks like we need a revert or a workaround (different than disabling optimization completely, which is not an option) soon. It could be in the shape of keeping the old behavior under a command line option, for example. If it helps, stack traces captured during the never-ending compilation look mostly like this:
|
Here's the reproducer for the OOM issue: repro.cc With
With
Please revert or add a way to avoid this issue (basically what @alexfh requested too) |
test.tar.gz (or https://gcc.godbolt.org/z/rfrsc7GaE) The test case for compilation time regression. This is only reproducible with
|
I've sent #132431 to revert this, but if there's a good workaround (or one can be implemented quickly and reliably), that would also be an option. |
The current method for determining computed gotos is inaccurate, and I will send a PR. |
Fixes #106846.
This is what I learned from GCC. I found that GCC does not duplicate the BB that has indirect jumps with the jump table. I believe GCC has provided a clear explanation here: