Skip to content

Commit c5c5225

Browse files
committed
[AMDGPU] Handle direct loads to LDS in memory model
Add additional waitcnt insertion to ensure proper ordering between LDS operations and direct loads from global memory to LDS on pre-GFX10 hardware. Direct LDS loads perform both a global memory load and an LDS store, which can be reordered with respect to other LDS operations without explicit synchronization. This can cause ordering violations even within a single thread. The change conservatively inserts vmcnt(0) waits for all sync scopes when the LDS address space is involved. Future optimizations in SIInsertWaitcnts can relax this to only wait for outstanding direct LDS loads rather than all vmcnt events. This change only affects LDS address space synchronization and preserves existing cross-address space ordering behavior.
1 parent 551d49c commit c5c5225

File tree

5 files changed

+103
-1
lines changed

5 files changed

+103
-1
lines changed

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1084,6 +1084,7 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
10841084

10851085
bool VMCnt = false;
10861086
bool LGKMCnt = false;
1087+
bool DirectLDSWait = false;
10871088

10881089
if ((AddrSpace & (SIAtomicAddrSpace::GLOBAL | SIAtomicAddrSpace::SCRATCH)) !=
10891090
SIAtomicAddrSpace::NONE) {
@@ -1104,6 +1105,10 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
11041105
}
11051106

11061107
if ((AddrSpace & SIAtomicAddrSpace::LDS) != SIAtomicAddrSpace::NONE) {
1108+
// Wait for direct loads to LDS from global memory to ensure that
1109+
// LDS operations cannot be reordered with respect to global memory
1110+
// operations.
1111+
DirectLDSWait = true;
11071112
switch (Scope) {
11081113
case SIAtomicScope::SYSTEM:
11091114
case SIAtomicScope::AGENT:
@@ -1149,6 +1154,18 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
11491154
}
11501155
}
11511156

1157+
// Conservatively wait for vmcnt(0) to ensure that LDS operations and direct
1158+
// LDS loads from global memory cannot be reordered with respect to each other.
1159+
// This waitcnt can be safely optimized to wait for a higher vmcnt based on
1160+
// the number of outstanding direct LDS loads.
1161+
if (DirectLDSWait) {
1162+
unsigned WaitCntImmediate = AMDGPU::encodeWaitcnt(
1163+
IV, 0, getExpcntBitMask(IV), getLgkmcntBitMask(IV));
1164+
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_DIRECT_LDS_LOAD_soft))
1165+
.addImm(WaitCntImmediate);
1166+
Changed = true;
1167+
}
1168+
11521169
if (VMCnt || LGKMCnt) {
11531170
unsigned WaitCntImmediate =
11541171
AMDGPU::encodeWaitcnt(IV,

0 commit comments

Comments
 (0)