-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[AMDGPU] Support bottom-up postRA scheduing. #135295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
93b6595
to
8574e1f
Compare
8574e1f
to
84e6845
Compare
@llvm/pr-subscribers-backend-amdgpu Author: Harrison Hao (harrisonGPU) ChangesSolely relying on top‑down scheduling can underutilize hardware, since long‑latency instructions often end up scheduled too late and their latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us move those instructions earlier, which improves latency hiding and yields roughly a 2% performance gain on key benchmarks. Full diff: https://github.com/llvm/llvm-project/pull/135295.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index aaefe27b1324f..205cb912126e7 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -284,6 +284,33 @@ void GCNHazardRecognizer::processBundle() {
CurrCycleInstr = nullptr;
}
+void GCNHazardRecognizer::processBundleBottomUp() {
+ // Step through each instruction in the bundle in bottom-up order.
+ MachineBasicBlock::instr_iterator MI =
+ std::next(CurrCycleInstr->getIterator());
+ MachineBasicBlock::instr_iterator E =
+ CurrCycleInstr->getParent()->instr_end();
+
+ // Evict stale entries to maintain a fixed lookahead window.
+ // TODO: Hazard detection is not yet implemented. This scheduling
+ // is intended for GFX11 and newer.
+ for (; MI != E && MI->isInsideBundle(); ++MI) {
+ CurrCycleInstr = &*MI;
+
+ // Remove up to (MaxLookAhead - 1) oldest entries.
+ for (unsigned I = 0, E = MaxLookAhead - 1; I < E && !EmittedInstrs.empty();
+ ++I)
+ EmittedInstrs.pop_back();
+
+ EmittedInstrs.push_back(CurrCycleInstr);
+
+ // Keep only the most recent MaxLookAhead entries
+ EmittedInstrs.resize(MaxLookAhead);
+ }
+
+ CurrCycleInstr = nullptr;
+}
+
void GCNHazardRecognizer::runOnInstruction(MachineInstr *MI) {
assert(IsHazardRecognizerMode);
@@ -423,7 +450,41 @@ void GCNHazardRecognizer::AdvanceCycle() {
}
void GCNHazardRecognizer::RecedeCycle() {
- llvm_unreachable("hazard recognizer does not support bottom-up scheduling.");
+ // If no instruction was issued this cycle, pop the oldest placeholder.
+ if (!CurrCycleInstr) {
+ if (!EmittedInstrs.empty())
+ EmittedInstrs.pop_back();
+ return;
+ }
+
+ // If this is a bundle header, handle the entire bundle here.
+ if (CurrCycleInstr->isBundle()) {
+ processBundleBottomUp();
+ return;
+ }
+
+ unsigned NumWaitStates = TII.getNumWaitStates(*CurrCycleInstr);
+ if (!NumWaitStates) {
+ CurrCycleInstr = nullptr;
+ return;
+ }
+
+ // Add current instruction to the emitted list.
+ EmittedInstrs.push_back(CurrCycleInstr);
+
+ // Model remaining wait states by removing older placeholders.
+ for (unsigned I = 1, E = std::min(NumWaitStates, getMaxLookAhead()); I < E;
+ ++I) {
+ if (!EmittedInstrs.empty())
+ EmittedInstrs.pop_back();
+ }
+
+ // getMaxLookahead() is the largest number of wait states we will ever need
+ // to insert, so there is no point in keeping track of more than that many
+ // wait states.
+ EmittedInstrs.resize(getMaxLookAhead());
+
+ CurrCycleInstr = nullptr;
}
//===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
index bbc55851bf967..88c7426be552d 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
@@ -69,6 +69,10 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
// Advance over a MachineInstr bundle. Look for hazards in the bundled
// instructions.
void processBundle();
+ // Recede over a MachineInstr bundle. Adds bundled instructions to the
+ // EmittedInstrs queue in bottom-up scheduling mode.
+ // TODO: Hazard detection is not yet implemented.
+ void processBundleBottomUp();
// Run on an individual instruction in hazard recognizer mode. This can be
// used on a newly inserted instruction before returning from PreEmitNoops.
diff --git a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
index 7bdb8f5b35ec5..02ebffca84bda 100644
--- a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
+++ b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
@@ -1,5 +1,6 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
-# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -verify-misched -o - %s | FileCheck %s
+# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -verify-misched -o - %s | FileCheck -check-prefix=CHECK %s
+# RUN: llc -mtriple=amdgcn -mcpu=gfx908 -misched-cluster=false -run-pass=postmisched -misched-postra-direction=bottomup -verify-misched -o - %s | FileCheck -check-prefix=CHECK-BOTTOMUP %s
--- |
define amdgpu_kernel void @no_sched_barrier(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
@@ -29,6 +30,21 @@ body: |
; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
; CHECK-NEXT: }
; CHECK-NEXT: S_ENDPGM 0
+ ;
+ ; CHECK-BOTTOMUP-LABEL: name: no_sched_barrier
+ ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
renamable $sgpr0_sgpr1 = IMPLICIT_DEF
renamable $vgpr0 = IMPLICIT_DEF
BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
@@ -66,6 +82,22 @@ body: |
; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
; CHECK-NEXT: }
; CHECK-NEXT: S_ENDPGM 0
+ ;
+ ; CHECK-BOTTOMUP-LABEL: name: sched_barrier_mask_0
+ ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: SCHED_BARRIER 0
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
renamable $sgpr0_sgpr1 = IMPLICIT_DEF
renamable $vgpr0 = IMPLICIT_DEF
BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
@@ -105,6 +137,22 @@ body: |
; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
; CHECK-NEXT: }
; CHECK-NEXT: S_ENDPGM 0
+ ;
+ ; CHECK-BOTTOMUP-LABEL: name: sched_barrier_mask_1
+ ; CHECK-BOTTOMUP: renamable $vgpr0 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+ ; CHECK-BOTTOMUP-NEXT: SCHED_BARRIER 1
+ ; CHECK-BOTTOMUP-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr0_sgpr1, implicit $exec, implicit killed $vgpr2 {
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+ ; CHECK-BOTTOMUP-NEXT: }
+ ; CHECK-BOTTOMUP-NEXT: S_ENDPGM 0
renamable $sgpr0_sgpr1 = IMPLICIT_DEF
renamable $vgpr0 = IMPLICIT_DEF
BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
|
4323b2b
to
33c5df4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this change is actually doing what you expect.
Because none of the additional code is having any impact if hazard recognition using EmittedInstrs
is never used.
If the goal is to allow enabling of bottom up scheduling post-RA on GFX11+ then the only required change is
void GCNHazardRecognizer::RecedeCycle() {
if (IsHazardRecognizerMode || ST.getGeneration() < GFX11)
llvm_unreachable("hazard recognizer does not support bottom-up scheduling.");
}
I expect if you do only the above you'll find the generated ISA is exactly the same as this change.
To support architectures that rely on the EmittedInstrs
vector then its state needs to be properly maintained and tests covering targets which rely on its state need to be provided.
} | ||
|
||
// Add current instruction to the emitted list. | ||
EmittedInstrs.push_back(CurrCycleInstr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sequence of code doesn't make sense.
- It pushes
CurrCycleInstr
onto the end ofEmittedInstrs
- Removes values from the end of
EmittedInstrs
. This will start by disposing ofCurrCycleInstr
, completely negating the previous operation. - Resizes
EmittedInstrs
togetMaxLookAhead()
. In most casesEmittedInstrs
will be 0 or 1 elements in size by this point. Hence this will expand it with empty elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot—I really appreciate the feedback! :-)
I understand your concerns and agree with your point of view.
I was comparing it with AdvanceCycle() and misunderstood the difference, but I’ve now updated the patch: bottom‑up scheduling on GFX11+ skips the hazard recognizer, so maintaining EmittedInstrs is no longer necessary.
Thanks again for pointing this out !!
ee44249
to
ebeedaf
Compare
llvm_unreachable( | ||
"hazard recognizer does not support bottom-up scheduling."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be unreachable, it's a fatal error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I updated it.
fae1c91
to
9471187
Compare
@@ -417,7 +417,9 @@ void GCNHazardRecognizer::AdvanceCycle() { | |||
} | |||
|
|||
void GCNHazardRecognizer::RecedeCycle() { | |||
llvm_unreachable("hazard recognizer does not support bottom-up scheduling."); | |||
if (IsHazardRecognizerMode || ST.getGeneration() < AMDGPUSubtarget::GFX11) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should still be assert(!IsHazardRecognizerMode)
because the standalone hazard recognizer always runs top-down, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, hazard recognizer always runs top-down, I have updated it, what do you think about it?
if (ST.getGeneration() < AMDGPUSubtarget::GFX11) | ||
report_fatal_error("Hazard recognizer does not support bottom-up " | ||
"scheduling on pre‑GFX11."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether this should be an error or not. The compiler will still generate correct code, even on GFX10 or earlier, right? The only problem is that the post-RA scheduler will not be aware of some hazards, so there will be more work for the standalone hazard recognizer pass to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it generates correct code on GFX10, but I think we should enable it first, because I believe there isn’t a single approach that can handle all instruction scheduling cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the standalone hazard recognizer was only used at -O0, and the hazard recognition during the post-RA scheduler was exclusively used. i.e. only one or the other run, never both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe hazard recognition does not affect bottom-up scheduling, because the Post-RA Hazard Recognizer
pass runs after the Post-RA Machine Instruction Scheduler
.
It sets IsHazardRecognizerMode
to true only when PreEmitNoops
is invoked, as shown here:
llvm-project/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
Lines 297 to 304 in adbbb90
unsigned GCNHazardRecognizer::PreEmitNoops(MachineInstr *MI) { | |
IsHazardRecognizerMode = true; | |
CurrCycleInstr = MI; | |
unsigned W = PreEmitNoopsCommon(MI); | |
fixHazards(MI); | |
CurrCycleInstr = nullptr; | |
return W; | |
} |
Furthermore, this hazard recognizer only uses AdvanceCycle, not RecedeCycle:
llvm-project/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
Lines 283 to 294 in adbbb90
void GCNHazardRecognizer::runOnInstruction(MachineInstr *MI) { | |
assert(IsHazardRecognizerMode); | |
unsigned NumPreNoops = PreEmitNoops(MI); | |
EmitNoops(NumPreNoops); | |
if (MI->isInsideBundle()) | |
insertNoopsInBundle(MI, TII, NumPreNoops); | |
else | |
TII.insertNoops(*MI->getParent(), MachineBasicBlock::iterator(MI), | |
NumPreNoops); | |
EmitInstruction(MI); | |
AdvanceCycle(); |
I do believe this feature is worth investigating further, as it could help hide long-latency instructions and potentially improve GPU performance.
Do you think I should add an assert or report a fatal error instead? I’d appreciate any suggestions you might have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand why llc -mcpu=gfx1010 -misched-postra-direction=bottomup
should be an error. The code will still work correctly, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, Jay, I understand your mind, I think only add a assert is enough to support this feature, and I have updated it.
@@ -0,0 +1,158 @@ | |||
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this test show?
If this is showing an improvement with bottomup and you make it more obvious?
Or at least provide a comment in the test what it is testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have updated comments. What do you think about it?
19e9b1b
to
19f42d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
# RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -run-pass=postmisched -verify-misched -o - %s | FileCheck -check-prefix=TopDown %s | ||
# RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -run-pass=postmisched -misched-postra-direction=bottomup -verify-misched -o - %s | FileCheck -check-prefix=BottomUp %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it is more common for check prefixes to be all capitals:
# RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -run-pass=postmisched -verify-misched -o - %s | FileCheck -check-prefix=TopDown %s | |
# RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -run-pass=postmisched -misched-postra-direction=bottomup -verify-misched -o - %s | FileCheck -check-prefix=BottomUp %s | |
# RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -run-pass=postmisched -verify-misched -o - %s | FileCheck -check-prefix=TOPDOWN %s | |
# RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -run-pass=postmisched -misched-postra-direction=bottomup -verify-misched -o - %s | FileCheck -check-prefix=BOTTOMUP %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have updated it.
Solely relying on top‑down scheduling can underutilize hardware, since long‑latency instructions often end up scheduled too late and their latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us move those instructions earlier, which improves latency hiding and yields roughly a 2% performance gain on key benchmarks.