[AMDGPU][SDAG] Only fold flat offsets if they are inbounds #132353

ritter-x2a · 2025-03-21T07:41:03Z

For flat memory instructions where the address is supplied as a base address
register with an immediate offset, the memory aperture test ignores the
immediate offset. Currently, ISel does not respect that, which leads to
miscompilations where valid input programs crash when the address computation
relies on the immediate offset to get the base address in the proper memory
aperture. Global or scratch instructions are not affected.

This patch only selects flat instructions with immediate offsets from address
computations with the inbounds flag: If the address computation does not leave
the bounds of the allocated object, it cannot leave the bounds of the memory
aperture and is therefore safe to handle with an immediate offset.

It also adds the inbounds flag to DAG nodes resulting from transformations:

Address computations resulting from getObjectPtrOffset. As far as I can tell,
this function is only used to compute addresses within accessed memory ranges,
e.g., for loads and stores that are split during legalization.
Reassociated inbounds adds. If both involved operations are inbounds, then so
are operations after the transformation.
Address computations in the SelectionDAG lowering of the memcpy/move/set
intrinsics. Base and result of the address arithmetic there are accessed, so
the operation must be inbounds.

It might make sense to separate these changes into their own PR, but I don't
see a way to test them without adding a use of the inbounds SDAG flag.

Affected tests:

CodeGen/AMDGPU/fold-gep-offset.ll: Offsets are no longer wrongly folded,
added new positive tests where we still do fold them.
Transforms/InferAddressSpaces/AMDGPU/flat_atomic.ll: Offset folding doesn't
seem integral to this test, so the test is not changed to make offset folding
still happen.
CodeGen/AMDGPU/loop-prefetch-data.ll: loop-reduce prefers to base addresses
on the potentially OOB addresses used for prefetching for memory accesses,
that might be a separate issue to look into.
Added memset tests to CodeGen/AMDGPU/memintrinsic-unroll.ll to make sure that
offsets in the memset DAG lowering are still folded properly.

A similar patch for GlobalISel will follow.

Fixes SWDEV-516125.

ritter-x2a · 2025-03-21T07:41:16Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

[AMDGPU][SDAG] Only fold flat offsets if they are inbounds #132353 👈 (View in Graphite)
[AMDGPU][NFC] Mark GEPs in flat offset folding tests as inbounds #131994
[SDAG] Introduce inbounds flag for pointer arithmetic #131862 : 1 other dependent PR (#131863 )
[SeparateConstOffsetFromGEP] Preserve inbounds flag based on ValueTracking and NUW #130617
[SeparateConstOffsetFromGEP] Don't set unsound inbounds flag #130616
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

ritter-x2a · 2025-03-21T07:44:09Z

This is PR #131863 minus the "add inbounds to GEPs in tests" changes, which are in #131994. Resubmitted because of a Graphite issue.

llvmbot · 2025-03-21T07:45:22Z

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

For flat memory instructions where the address is supplied as a base address
register with an immediate offset, the memory aperture test ignores the
immediate offset. Currently, ISel does not respect that, which leads to
miscompilations where valid input programs crash when the address computation
relies on the immediate offset to get the base address in the proper memory
aperture. Global or scratch instructions are not affected.

This patch only selects flat instructions with immediate offsets from address
computations with the inbounds flag: If the address computation does not leave
the bounds of the allocated object, it cannot leave the bounds of the memory
aperture and is therefore safe to handle with an immediate offset.

It also adds the inbounds flag to DAG nodes resulting from transformations:

Address computations resulting from getObjectPtrOffset. As far as I can tell,
this function is only used to compute addresses within accessed memory ranges,
e.g., for loads and stores that are split during legalization.
Reassociated inbounds adds. If both involved operations are inbounds, then so
are operations after the transformation.
Address computations in the SelectionDAG lowering of the memcpy/move/set
intrinsics. Base and result of the address arithmetic there are accessed, so
the operation must be inbounds.

It might make sense to separate these changes into their own PR, but I don't
see a way to test them without adding a use of the inbounds SDAG flag.

Affected tests:

CodeGen/AMDGPU/fold-gep-offset.ll: Offsets are no longer wrongly folded,
added new positive tests where we still do fold them.
Transforms/InferAddressSpaces/AMDGPU/flat_atomic.ll: Offset folding doesn't
seem integral to this test, so the test is not changed to make offset folding
still happen.
CodeGen/AMDGPU/loop-prefetch-data.ll: loop-reduce prefers to base addresses
on the potentially OOB addresses used for prefetching for memory accesses,
that might be a separate issue to look into.
Added memset tests to CodeGen/AMDGPU/memintrinsic-unroll.ll to make sure that
offsets in the memset DAG lowering are still folded properly.

A similar patch for GlobalISel will follow.

Fixes SWDEV-516125.

Patch is 49.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/132353.diff

8 Files Affected:

(modified) llvm/include/llvm/CodeGen/SelectionDAG.h (+8-4)
(modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+6-3)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp (+6-6)
(modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+75-65)
(modified) llvm/test/CodeGen/AMDGPU/fold-gep-offset.ll (+369-8)
(modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+12-5)
(modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+241)
(modified) llvm/test/Transforms/InferAddressSpaces/AMDGPU/flat_atomic.ll (+3-3)

diff --git a/llvm/include/llvm/CodeGen/SelectionDAG.h b/llvm/include/llvm/CodeGen/SelectionDAG.h
index 15a2370e5d8b8..aa3668d3e9aae 100644
--- a/llvm/include/llvm/CodeGen/SelectionDAG.h
+++ b/llvm/include/llvm/CodeGen/SelectionDAG.h
@@ -1069,7 +1069,8 @@ class SelectionDAG {
                              SDValue EVL);
 
   /// Returns sum of the base pointer and offset.
-  /// Unlike getObjectPtrOffset this does not set NoUnsignedWrap by default.
+  /// Unlike getObjectPtrOffset this does not set NoUnsignedWrap and InBounds by
+  /// default.
   SDValue getMemBasePlusOffset(SDValue Base, TypeSize Offset, const SDLoc &DL,
                                const SDNodeFlags Flags = SDNodeFlags());
   SDValue getMemBasePlusOffset(SDValue Base, SDValue Offset, const SDLoc &DL,
@@ -1077,15 +1078,18 @@ class SelectionDAG {
 
   /// Create an add instruction with appropriate flags when used for
   /// addressing some offset of an object. i.e. if a load is split into multiple
-  /// components, create an add nuw from the base pointer to the offset.
+  /// components, create an add nuw inbounds from the base pointer to the
+  /// offset.
   SDValue getObjectPtrOffset(const SDLoc &SL, SDValue Ptr, TypeSize Offset) {
-    return getMemBasePlusOffset(Ptr, Offset, SL, SDNodeFlags::NoUnsignedWrap);
+    return getMemBasePlusOffset(
+        Ptr, Offset, SL, SDNodeFlags::NoUnsignedWrap | SDNodeFlags::InBounds);
   }
 
   SDValue getObjectPtrOffset(const SDLoc &SL, SDValue Ptr, SDValue Offset) {
     // The object itself can't wrap around the address space, so it shouldn't be
     // possible for the adds of the offsets to the split parts to overflow.
-    return getMemBasePlusOffset(Ptr, Offset, SL, SDNodeFlags::NoUnsignedWrap);
+    return getMemBasePlusOffset(
+        Ptr, Offset, SL, SDNodeFlags::NoUnsignedWrap | SDNodeFlags::InBounds);
   }
 
   /// Return a new CALLSEQ_START node, that starts new call frame, in which
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index a54857e1037e2..63e012c04fb59 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -1201,9 +1201,12 @@ SDValue DAGCombiner::reassociateOpsCommutative(unsigned Opc, const SDLoc &DL,
 
   if (DAG.isConstantIntBuildVectorOrConstantInt(N01)) {
     SDNodeFlags NewFlags;
-    if (N0.getOpcode() == ISD::ADD && N0->getFlags().hasNoUnsignedWrap() &&
-        Flags.hasNoUnsignedWrap())
-      NewFlags |= SDNodeFlags::NoUnsignedWrap;
+    if (N0.getOpcode() == ISD::ADD) {
+      if (N0->getFlags().hasNoUnsignedWrap() && Flags.hasNoUnsignedWrap())
+        NewFlags |= SDNodeFlags::NoUnsignedWrap;
+      if (N0->getFlags().hasInBounds() && Flags.hasInBounds())
+        NewFlags |= SDNodeFlags::InBounds;
+    }
 
     if (DAG.isConstantIntBuildVectorOrConstantInt(N1)) {
       // Reassociate: (op (op x, c1), c2) -> (op x, (op c1, c2))
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
index d1f92c9ef00e9..a86ad2acecd2c 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
@@ -8196,7 +8196,7 @@ static SDValue getMemcpyLoadsAndStores(
       if (Value.getNode()) {
         Store = DAG.getStore(
             Chain, dl, Value,
-            DAG.getMemBasePlusOffset(Dst, TypeSize::getFixed(DstOff), dl),
+            DAG.getObjectPtrOffset(dl, Dst, TypeSize::getFixed(DstOff)),
             DstPtrInfo.getWithOffset(DstOff), Alignment, MMOFlags, NewAAInfo);
         OutChains.push_back(Store);
       }
@@ -8221,14 +8221,14 @@ static SDValue getMemcpyLoadsAndStores(
 
       Value = DAG.getExtLoad(
           ISD::EXTLOAD, dl, NVT, Chain,
-          DAG.getMemBasePlusOffset(Src, TypeSize::getFixed(SrcOff), dl),
+          DAG.getObjectPtrOffset(dl, Src, TypeSize::getFixed(SrcOff)),
           SrcPtrInfo.getWithOffset(SrcOff), VT,
           commonAlignment(*SrcAlign, SrcOff), SrcMMOFlags, NewAAInfo);
       OutLoadChains.push_back(Value.getValue(1));
 
       Store = DAG.getTruncStore(
           Chain, dl, Value,
-          DAG.getMemBasePlusOffset(Dst, TypeSize::getFixed(DstOff), dl),
+          DAG.getObjectPtrOffset(dl, Dst, TypeSize::getFixed(DstOff)),
           DstPtrInfo.getWithOffset(DstOff), VT, Alignment, MMOFlags, NewAAInfo);
       OutStoreChains.push_back(Store);
     }
@@ -8365,7 +8365,7 @@ static SDValue getMemmoveLoadsAndStores(SelectionDAG &DAG, const SDLoc &dl,
 
     Value = DAG.getLoad(
         VT, dl, Chain,
-        DAG.getMemBasePlusOffset(Src, TypeSize::getFixed(SrcOff), dl),
+        DAG.getObjectPtrOffset(dl, Src, TypeSize::getFixed(SrcOff)),
         SrcPtrInfo.getWithOffset(SrcOff), *SrcAlign, SrcMMOFlags, NewAAInfo);
     LoadValues.push_back(Value);
     LoadChains.push_back(Value.getValue(1));
@@ -8380,7 +8380,7 @@ static SDValue getMemmoveLoadsAndStores(SelectionDAG &DAG, const SDLoc &dl,
 
     Store = DAG.getStore(
         Chain, dl, LoadValues[i],
-        DAG.getMemBasePlusOffset(Dst, TypeSize::getFixed(DstOff), dl),
+        DAG.getObjectPtrOffset(dl, Dst, TypeSize::getFixed(DstOff)),
         DstPtrInfo.getWithOffset(DstOff), Alignment, MMOFlags, NewAAInfo);
     OutChains.push_back(Store);
     DstOff += VTSize;
@@ -8512,7 +8512,7 @@ static SDValue getMemsetStores(SelectionDAG &DAG, const SDLoc &dl,
     assert(Value.getValueType() == VT && "Value with wrong type.");
     SDValue Store = DAG.getStore(
         Chain, dl, Value,
-        DAG.getMemBasePlusOffset(Dst, TypeSize::getFixed(DstOff), dl),
+        DAG.getObjectPtrOffset(dl, Dst, TypeSize::getFixed(DstOff)),
         DstPtrInfo.getWithOffset(DstOff), Alignment,
         isVol ? MachineMemOperand::MOVolatile : MachineMemOperand::MONone,
         NewAAInfo);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 536bf0c208752..62c009d06a4de 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -1744,72 +1744,82 @@ bool AMDGPUDAGToDAGISel::SelectFlatOffsetImpl(SDNode *N, SDValue Addr,
          isFlatScratchBaseLegal(Addr))) {
       int64_t COffsetVal = cast<ConstantSDNode>(N1)->getSExtValue();
 
-      const SIInstrInfo *TII = Subtarget->getInstrInfo();
-      if (TII->isLegalFLATOffset(COffsetVal, AS, FlatVariant)) {
-        Addr = N0;
-        OffsetVal = COffsetVal;
-      } else {
-        // If the offset doesn't fit, put the low bits into the offset field and
-        // add the rest.
-        //
-        // For a FLAT instruction the hardware decides whether to access
-        // global/scratch/shared memory based on the high bits of vaddr,
-        // ignoring the offset field, so we have to ensure that when we add
-        // remainder to vaddr it still points into the same underlying object.
-        // The easiest way to do that is to make sure that we split the offset
-        // into two pieces that are both >= 0 or both <= 0.
-
-        SDLoc DL(N);
-        uint64_t RemainderOffset;
-
-        std::tie(OffsetVal, RemainderOffset) =
-            TII->splitFlatOffset(COffsetVal, AS, FlatVariant);
-
-        SDValue AddOffsetLo =
-            getMaterializedScalarImm32(Lo_32(RemainderOffset), DL);
-        SDValue Clamp = CurDAG->getTargetConstant(0, DL, MVT::i1);
-
-        if (Addr.getValueType().getSizeInBits() == 32) {
-          SmallVector<SDValue, 3> Opnds;
-          Opnds.push_back(N0);
-          Opnds.push_back(AddOffsetLo);
-          unsigned AddOp = AMDGPU::V_ADD_CO_U32_e32;
-          if (Subtarget->hasAddNoCarry()) {
-            AddOp = AMDGPU::V_ADD_U32_e64;
-            Opnds.push_back(Clamp);
-          }
-          Addr = SDValue(CurDAG->getMachineNode(AddOp, DL, MVT::i32, Opnds), 0);
+      // Adding the offset to the base address in a FLAT instruction must not
+      // change the memory aperture in which the address falls. Therefore we can
+      // only fold offsets from inbounds GEPs into FLAT instructions.
+      bool IsInBounds = Addr->getFlags().hasInBounds();
+      if (COffsetVal == 0 || FlatVariant != SIInstrFlags::FLAT || IsInBounds) {
+        const SIInstrInfo *TII = Subtarget->getInstrInfo();
+        if (TII->isLegalFLATOffset(COffsetVal, AS, FlatVariant)) {
+          Addr = N0;
+          OffsetVal = COffsetVal;
         } else {
-          // TODO: Should this try to use a scalar add pseudo if the base address
-          // is uniform and saddr is usable?
-          SDValue Sub0 = CurDAG->getTargetConstant(AMDGPU::sub0, DL, MVT::i32);
-          SDValue Sub1 = CurDAG->getTargetConstant(AMDGPU::sub1, DL, MVT::i32);
-
-          SDNode *N0Lo = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG,
-                                                DL, MVT::i32, N0, Sub0);
-          SDNode *N0Hi = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG,
-                                                DL, MVT::i32, N0, Sub1);
-
-          SDValue AddOffsetHi =
-              getMaterializedScalarImm32(Hi_32(RemainderOffset), DL);
-
-          SDVTList VTs = CurDAG->getVTList(MVT::i32, MVT::i1);
-
-          SDNode *Add =
-              CurDAG->getMachineNode(AMDGPU::V_ADD_CO_U32_e64, DL, VTs,
-                                     {AddOffsetLo, SDValue(N0Lo, 0), Clamp});
-
-          SDNode *Addc = CurDAG->getMachineNode(
-              AMDGPU::V_ADDC_U32_e64, DL, VTs,
-              {AddOffsetHi, SDValue(N0Hi, 0), SDValue(Add, 1), Clamp});
-
-          SDValue RegSequenceArgs[] = {
-              CurDAG->getTargetConstant(AMDGPU::VReg_64RegClassID, DL, MVT::i32),
-              SDValue(Add, 0), Sub0, SDValue(Addc, 0), Sub1};
-
-          Addr = SDValue(CurDAG->getMachineNode(AMDGPU::REG_SEQUENCE, DL,
-                                                MVT::i64, RegSequenceArgs),
-                         0);
+          // If the offset doesn't fit, put the low bits into the offset field
+          // and add the rest.
+          //
+          // For a FLAT instruction the hardware decides whether to access
+          // global/scratch/shared memory based on the high bits of vaddr,
+          // ignoring the offset field, so we have to ensure that when we add
+          // remainder to vaddr it still points into the same underlying object.
+          // The easiest way to do that is to make sure that we split the offset
+          // into two pieces that are both >= 0 or both <= 0.
+
+          SDLoc DL(N);
+          uint64_t RemainderOffset;
+
+          std::tie(OffsetVal, RemainderOffset) =
+              TII->splitFlatOffset(COffsetVal, AS, FlatVariant);
+
+          SDValue AddOffsetLo =
+              getMaterializedScalarImm32(Lo_32(RemainderOffset), DL);
+          SDValue Clamp = CurDAG->getTargetConstant(0, DL, MVT::i1);
+
+          if (Addr.getValueType().getSizeInBits() == 32) {
+            SmallVector<SDValue, 3> Opnds;
+            Opnds.push_back(N0);
+            Opnds.push_back(AddOffsetLo);
+            unsigned AddOp = AMDGPU::V_ADD_CO_U32_e32;
+            if (Subtarget->hasAddNoCarry()) {
+              AddOp = AMDGPU::V_ADD_U32_e64;
+              Opnds.push_back(Clamp);
+            }
+            Addr =
+                SDValue(CurDAG->getMachineNode(AddOp, DL, MVT::i32, Opnds), 0);
+          } else {
+            // TODO: Should this try to use a scalar add pseudo if the base
+            // address is uniform and saddr is usable?
+            SDValue Sub0 =
+                CurDAG->getTargetConstant(AMDGPU::sub0, DL, MVT::i32);
+            SDValue Sub1 =
+                CurDAG->getTargetConstant(AMDGPU::sub1, DL, MVT::i32);
+
+            SDNode *N0Lo = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG,
+                                                  DL, MVT::i32, N0, Sub0);
+            SDNode *N0Hi = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG,
+                                                  DL, MVT::i32, N0, Sub1);
+
+            SDValue AddOffsetHi =
+                getMaterializedScalarImm32(Hi_32(RemainderOffset), DL);
+
+            SDVTList VTs = CurDAG->getVTList(MVT::i32, MVT::i1);
+
+            SDNode *Add =
+                CurDAG->getMachineNode(AMDGPU::V_ADD_CO_U32_e64, DL, VTs,
+                                       {AddOffsetLo, SDValue(N0Lo, 0), Clamp});
+
+            SDNode *Addc = CurDAG->getMachineNode(
+                AMDGPU::V_ADDC_U32_e64, DL, VTs,
+                {AddOffsetHi, SDValue(N0Hi, 0), SDValue(Add, 1), Clamp});
+
+            SDValue RegSequenceArgs[] = {
+                CurDAG->getTargetConstant(AMDGPU::VReg_64RegClassID, DL,
+                                          MVT::i32),
+                SDValue(Add, 0), Sub0, SDValue(Addc, 0), Sub1};
+
+            Addr = SDValue(CurDAG->getMachineNode(AMDGPU::REG_SEQUENCE, DL,
+                                                  MVT::i64, RegSequenceArgs),
+                           0);
+          }
         }
       }
     }
diff --git a/llvm/test/CodeGen/AMDGPU/fold-gep-offset.ll b/llvm/test/CodeGen/AMDGPU/fold-gep-offset.ll
index 88cc4b1c96b4a..512dc21ade325 100644
--- a/llvm/test/CodeGen/AMDGPU/fold-gep-offset.ll
+++ b/llvm/test/CodeGen/AMDGPU/fold-gep-offset.ll
@@ -16,8 +16,8 @@
 ;     gep[inbounds](p, i + 3) -> gep(gep(p, i), 3)
 
 
-; FIXME the offset here should not be folded: if %p points to the beginning of
-; scratch or LDS and %i is -1, a folded offset crashes the program.
+; The offset here cannot be folded: if %p points to the beginning of scratch or
+; LDS and %i is -1, a folded offset crashes the program.
 define i32 @flat_offset_maybe_oob(ptr %p, i32 %i) {
 ; GFX90A-LABEL: flat_offset_maybe_oob:
 ; GFX90A:       ; %bb.0:
@@ -26,7 +26,9 @@ define i32 @flat_offset_maybe_oob(ptr %p, i32 %i) {
 ; GFX90A-NEXT:    v_lshlrev_b64 v[2:3], 2, v[2:3]
 ; GFX90A-NEXT:    v_add_co_u32_e32 v0, vcc, v0, v2
 ; GFX90A-NEXT:    v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
-; GFX90A-NEXT:    flat_load_dword v0, v[0:1] offset:12
+; GFX90A-NEXT:    v_add_co_u32_e32 v0, vcc, 12, v0
+; GFX90A-NEXT:    v_addc_co_u32_e32 v1, vcc, 0, v1, vcc
+; GFX90A-NEXT:    flat_load_dword v0, v[0:1]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -37,7 +39,9 @@ define i32 @flat_offset_maybe_oob(ptr %p, i32 %i) {
 ; GFX10-NEXT:    v_lshlrev_b64 v[2:3], 2, v[2:3]
 ; GFX10-NEXT:    v_add_co_u32 v0, vcc_lo, v0, v2
 ; GFX10-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, v1, v3, vcc_lo
-; GFX10-NEXT:    flat_load_dword v0, v[0:1] offset:12
+; GFX10-NEXT:    v_add_co_u32 v0, vcc_lo, v0, 12
+; GFX10-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, 0, v1, vcc_lo
+; GFX10-NEXT:    flat_load_dword v0, v[0:1]
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -46,7 +50,8 @@ define i32 @flat_offset_maybe_oob(ptr %p, i32 %i) {
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_ashrrev_i32_e32 v3, 31, v2
 ; GFX942-NEXT:    v_lshl_add_u64 v[0:1], v[2:3], 2, v[0:1]
-; GFX942-NEXT:    flat_load_dword v0, v[0:1] offset:12
+; GFX942-NEXT:    v_lshl_add_u64 v[0:1], v[0:1], 0, 12
+; GFX942-NEXT:    flat_load_dword v0, v[0:1]
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -57,9 +62,12 @@ define i32 @flat_offset_maybe_oob(ptr %p, i32 %i) {
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX11-NEXT:    v_lshlrev_b64 v[2:3], 2, v[2:3]
 ; GFX11-NEXT:    v_add_co_u32 v0, vcc_lo, v0, v2
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
 ; GFX11-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, v1, v3, vcc_lo
-; GFX11-NEXT:    flat_load_b32 v0, v[0:1] offset:12
+; GFX11-NEXT:    v_add_co_u32 v0, vcc_lo, v0, 12
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GFX11-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, 0, v1, vcc_lo
+; GFX11-NEXT:    flat_load_b32 v0, v[0:1]
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -76,7 +84,10 @@ define i32 @flat_offset_maybe_oob(ptr %p, i32 %i) {
 ; GFX12-NEXT:    v_add_co_u32 v0, vcc_lo, v0, v2
 ; GFX12-NEXT:    s_wait_alu 0xfffd
 ; GFX12-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, v1, v3, vcc_lo
-; GFX12-NEXT:    flat_load_b32 v0, v[0:1] offset:12
+; GFX12-NEXT:    v_add_co_u32 v0, vcc_lo, v0, 12
+; GFX12-NEXT:    s_wait_alu 0xfffd
+; GFX12-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, 0, v1, vcc_lo
+; GFX12-NEXT:    flat_load_b32 v0, v[0:1]
 ; GFX12-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX12-NEXT:    s_wait_alu 0xfffd
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
@@ -157,3 +168,353 @@ define i32 @private_offset_maybe_oob(ptr addrspace(5) %p, i32 %i) {
   %l = load i32, ptr addrspace(5) %arrayidx
   ret i32 %l
 }
+
+; If the GEP that adds the offset is inbounds, folding the offset is legal.
+define i32 @flat_offset_inbounds(ptr %p, i32 %i) {
+; GFX90A-LABEL: flat_offset_inbounds:
+; GFX90A:       ; %bb.0:
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    v_ashrrev_i32_e32 v3, 31, v2
+; GFX90A-NEXT:    v_lshlrev_b64 v[2:3], 2, v[2:3]
+; GFX90A-NEXT:    v_add_co_u32_e32 v0, vcc, v0, v2
+; GFX90A-NEXT:    v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
+; GFX90A-NEXT:    flat_load_dword v0, v[0:1] offset:12
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: flat_offset_inbounds:
+; GFX10:       ; %bb.0:
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_ashrrev_i32_e32 v3, 31, v2
+; GFX10-NEXT:    v_lshlrev_b64 v[2:3], 2, v[2:3]
+; GFX10-NEXT:    v_add_co_u32 v0, vcc_lo, v0, v2
+; GFX10-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, v1, v3, vcc_lo
+; GFX10-NEXT:    flat_load_dword v0, v[0:1] offset:12
+; GFX10-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX942-LABEL: flat_offset_inbounds:
+; GFX942:       ; %bb.0:
+; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    v_ashrrev_i32_e32 v3, 31, v2
+; GFX942-NEXT:    v_lshl_add_u64 v[0:1], v[2:3], 2, v[0:1]
+; GFX942-NEXT:    flat_load_dword v0, v[0:1] offset:12
+; GFX942-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-LABEL: flat_offset_inbounds:
+; GFX11:       ; %bb.0:
+; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT:    v_ashrrev_i32_e32 v3, 31, v2
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT:    v_lshlrev_b64 v[2:3], 2, v[2:3]
+; GFX11-NEXT:    v_add_co_u32 v0, vcc_lo, v0, v2
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2)
+; GFX11-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, v1, v3, vcc_lo
+; GFX11-NEXT:    flat_load_b32 v0, v[0:1] offset:12
+; GFX11-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX11-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX12-LABEL: flat_offset_inbounds:
+; GFX12:       ; %bb.0:
+; GFX12-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-NEXT:    s_wait_expcnt 0x0
+; GFX12-NEXT:    s_wait_samplecnt 0x0
+; GFX12-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    v_ashrrev_i32_e32 v3, 31, v2
+; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-NEXT:    v_lshlrev_b64_e32 v[2:3], 2, v[2:3]
+; GFX12-NEXT:    v_add_co_u32 v0, vcc_lo, v0, v2
+; GFX12-NEXT:    s_wait_alu 0xfffd
+; GFX12-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, v1, v3, vcc_lo
+; GFX12-NEXT:    flat_load_b32 v0, v[0:1] offset:12
+; GFX12-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-NEXT:    s_wait_alu 0xfffd
+; GFX12-NEXT:    s_setpc_b64 s[30:31]
+  %p.1 = getelementptr inbounds i32, ptr %p, i32 %i
+  %arrayidx = getelementptr inbounds i32, ptr %p.1, i32 3
+  %l = load i32, ptr %arrayidx
+  ret i32 %l
+}
+
+define void @flat_offset_inbounds_wide(ptr %p, ptr %pout, i32 %i) {
+; GFX90A-LABEL: flat_offset_inbounds_wide:
+; GFX90A:       ; %bb.0:
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    v_ashrrev_i32_e32 v5, 31, v4
+; GFX90A-NEXT:    v_lshlrev_b64 v[4:5], 2, v[4:5]
+; GFX90A-NEXT:    v_add_co_u32_e32 v0, vcc, v0, v4
+; GFX90A-NEXT:    v_addc_co_u32_e32 v1, vcc, v1, v5, vcc
+; GFX90A-NEXT:    flat_load_dword v8, v[0:1] offset:28
+; GFX90A-NEXT:    flat_load_dwordx4 v[4:7], v[0:1] offset:12
+; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    flat_store_dwo...
[truncated]

ritter-x2a · 2025-03-21T08:03:08Z

llvm/test/Transforms/InferAddressSpaces/AMDGPU/flat_atomic.ll

+; CHECK-NEXT:    s_add_u32 s0, s0, -8
+; CHECK-NEXT:    s_addc_u32 s1, s1, -1
+; CHECK-NEXT:    v_pk_mov_b32 v[2:3], s[0:1], s[0:1] op_sel:[0,1]


@arsenm regarding your question here from the old PR:

Not sure what happened here (and why do we have a codegen test in test/Transforms?)

Before the patch, the negative constant offset is matched as part of the flat instruction, but since negative offsets are illegal in gfx908, the constant offset is split into a part that fits into the immediate (0) and one that doesn't fit (-8).
After the patch, the constant offset is not matched as part of the flat instruction since the address computation resulting from SeparateConstOffsetFromGEP is not marked inbounds, so it is matched separately (apparently resulting in different code).
We could get the same behavior as before by manually applying SeparateConstOffsetFromGEP to the test input and adding inbounds flags to the resulting GEPs, but that doesn't seem very central to the test.

We can also move this test to Codegen/AMDGPU, but I think that would be something for a separate PR.

arsenm · 2025-04-13T08:57:51Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

          SrcPtrInfo.getWithOffset(SrcOff), VT,
          commonAlignment(*SrcAlign, SrcOff), SrcMMOFlags, NewAAInfo);
      OutLoadChains.push_back(Value.getValue(1));

      Store = DAG.getTruncStore(
          Chain, dl, Value,
-          DAG.getMemBasePlusOffset(Dst, TypeSize::getFixed(DstOff), dl),
+          DAG.getObjectPtrOffset(dl, Dst, TypeSize::getFixed(DstOff)),


These should be split into a separate change

For flat memory instructions where the address is supplied as a base address register with an immediate offset, the memory aperture test ignores the immediate offset. Currently, ISel does not respect that, which leads to miscompilations where valid input programs crash when the address computation relies on the immediate offset to get the base address in the proper memory aperture. Global or scratch instructions are not affected. This patch only selects flat instructions with immediate offsets from address computations with the inbounds flag: If the address computation does not leave the bounds of the allocated object, it cannot leave the bounds of the memory aperture and is therefore safe to handle with an immediate offset. It also adds the inbounds flag to DAG nodes resulting from transformations: - Address computations resulting from getObjectPtrOffset. As far as I can tell, this function is only used to compute addresses within accessed memory ranges, e.g., for loads and stores that are split during legalization. - Reassociated inbounds adds. If both involved operations are inbounds, then so are operations after the transformation. - Address computations in the SelectionDAG lowering of the memcpy/move/set intrinsics. Base and result of the address arithmetic there are accessed, so the operation must be inbounds. It might make sense to separate these changes into their own PR, but I don't see a way to test them without adding a use of the inbounds SDAG flag. Affected tests: - CodeGen/AMDGPU/fold-gep-offset.ll: Offsets are no longer wrongly folded, added new positive tests where we still do fold them. - Transforms/InferAddressSpaces/AMDGPU/flat_atomic.ll: Offset folding doesn't seem integral to this test, so the test is not changed to make offset folding still happen. - CodeGen/AMDGPU/loop-prefetch-data.ll: loop-reduce prefers to base addresses on the potentially OOB addresses used for prefetching for memory accesses, that might be a separate issue to look into. - Added memset tests to CodeGen/AMDGPU/memintrinsic-unroll.ll to make sure that offsets in the memset DAG lowering are still folded properly. A similar patch for GlobalISel will follow. Fixes SWDEV-516125.

ritter-x2a mentioned this pull request Mar 21, 2025

[AMDGPU][NFC] Mark GEPs in flat offset folding tests as inbounds #131994

Open

ritter-x2a added backend:AMDGPU llvm:SelectionDAG SelectionDAGISel as well labels Mar 21, 2025 — with Graphite App

ritter-x2a requested review from arichardson, arsenm, mariusz-sikora-at-amd and shiltian March 21, 2025 07:45

ritter-x2a marked this pull request as ready for review March 21, 2025 07:47

llvmbot added the llvm:transforms label Mar 21, 2025

ritter-x2a mentioned this pull request Mar 21, 2025

[AMDGPU][SDAG] Only fold flat offsets if they are inbounds #131863

Closed

ritter-x2a commented Mar 21, 2025

View reviewed changes

ritter-x2a force-pushed the users/ritter-x2a/03-21-_amdgpu_sdag_only_fold_flat_offsets_if_they_are_inbounds branch from 2e599c8 to dd12dc2 Compare March 21, 2025 13:35

ritter-x2a force-pushed the users/ritter-x2a/03-19-_amdgpu_nfc_mark_geps_in_flat_offset_folding_tests_as_inbounds branch 2 times, most recently from 173036a to 5e4da87 Compare March 27, 2025 09:02

ritter-x2a force-pushed the users/ritter-x2a/03-21-_amdgpu_sdag_only_fold_flat_offsets_if_they_are_inbounds branch from dd12dc2 to 9294d4f Compare March 27, 2025 09:02

ritter-x2a force-pushed the users/ritter-x2a/03-19-_amdgpu_nfc_mark_geps_in_flat_offset_folding_tests_as_inbounds branch from 5e4da87 to de34b4c Compare March 31, 2025 11:37

ritter-x2a force-pushed the users/ritter-x2a/03-21-_amdgpu_sdag_only_fold_flat_offsets_if_they_are_inbounds branch from 9294d4f to a8155cf Compare March 31, 2025 11:37

ritter-x2a force-pushed the users/ritter-x2a/03-21-_amdgpu_sdag_only_fold_flat_offsets_if_they_are_inbounds branch from a8155cf to 11282b1 Compare April 10, 2025 12:27

ritter-x2a force-pushed the users/ritter-x2a/03-19-_amdgpu_nfc_mark_geps_in_flat_offset_folding_tests_as_inbounds branch 2 times, most recently from 39089ea to 0614081 Compare April 10, 2025 14:08

ritter-x2a force-pushed the users/ritter-x2a/03-21-_amdgpu_sdag_only_fold_flat_offsets_if_they_are_inbounds branch from 11282b1 to b3a2dc9 Compare April 10, 2025 14:08

arsenm reviewed Apr 13, 2025

View reviewed changes

ritter-x2a force-pushed the users/ritter-x2a/03-19-_amdgpu_nfc_mark_geps_in_flat_offset_folding_tests_as_inbounds branch from 0614081 to ab60c5a Compare April 14, 2025 09:39

ritter-x2a force-pushed the users/ritter-x2a/03-21-_amdgpu_sdag_only_fold_flat_offsets_if_they_are_inbounds branch from b3a2dc9 to 164ae85 Compare April 14, 2025 09:39

ritter-x2a force-pushed the users/ritter-x2a/03-19-_amdgpu_nfc_mark_geps_in_flat_offset_folding_tests_as_inbounds branch from ab60c5a to cb87bed Compare April 23, 2025 10:41

ritter-x2a force-pushed the users/ritter-x2a/03-21-_amdgpu_sdag_only_fold_flat_offsets_if_they_are_inbounds branch from 164ae85 to 4adc54a Compare April 23, 2025 10:42

ritter-x2a mentioned this pull request Jun 12, 2025

[Issue]: Flat address space operations can miscompile ROCm/ROCm#4389

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU][SDAG] Only fold flat offsets if they are inbounds #132353

[AMDGPU][SDAG] Only fold flat offsets if they are inbounds #132353

Uh oh!

ritter-x2a commented Mar 21, 2025

Uh oh!

ritter-x2a commented Mar 21, 2025 •

edited

Loading

Uh oh!

ritter-x2a commented Mar 21, 2025

Uh oh!

llvmbot commented Mar 21, 2025 •

edited

Loading

Uh oh!

ritter-x2a Mar 21, 2025

Uh oh!

arsenm Apr 13, 2025

Uh oh!

Uh oh!

[AMDGPU][SDAG] Only fold flat offsets if they are inbounds #132353

Are you sure you want to change the base?

[AMDGPU][SDAG] Only fold flat offsets if they are inbounds #132353

Uh oh!

Conversation

ritter-x2a commented Mar 21, 2025

Uh oh!

ritter-x2a commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ritter-x2a commented Mar 21, 2025

Uh oh!

llvmbot commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ritter-x2a Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Apr 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ritter-x2a commented Mar 21, 2025 •

edited

Loading

llvmbot commented Mar 21, 2025 •

edited

Loading