Skip to content

Conversation

@michaelselehov
Copy link

AMDGPU: reflect lack of complex addressing in LSR; fix regressions after SROA copy-load promotion

Background

Recent SROA changes promote copy-load style IR more aggressively. As a side effect, LoopStrengthReduce (LSR) now triggers on loops where it previously stayed quiet. On AMDGPU (no rich base+index addressing modes), generic LSR heuristics could pick plans that looked cheaper but expanded into extra address formation work in the loop body, causing performance regressions.

Problem

  • A scaled index with a base register is not free on AMDGPU; it becomes an extra per-iteration add.
  • LSR’s Insns metric did not count this address work, so plans could “win” by 1 instruction in theory but actually add a v_add in the loop.
  • On GFX9+ we care more about per-iteration work than about small, one-time preheader setup.

Solution

  • Implement getScalingFactorCost in GCNTTIImpl to model scaled addressing:
    If HasBaseReg && Scale != 0, return cost 1 (one extra per-iteration instruction).
  • Use Effective Insns as the top metric in isLSRCostLess (GFX9+):
    Effective Insns = Insns + ScaleCost
    This makes scaled address formation count as real per-iteration work.

Priority order on GFX9+

  1. Effective Insns (Insns + ScaleCost)
  2. NumIVMuls (penalize mul/mul_hi/addc IV chains)
  3. AddRecCost (prefer fewer IV updates)
  4. NumBaseAdds
  5. SetupCost
  6. ImmCost
  7. NumRegs
  • Pre-GFX9 behavior is unchanged (fall back to base implementation).

Rationale

  • AMDGPU lacks hardware complex addressing. Base + (iv * scale) requires an extra instruction; modeling that as ScaleCost = 1 and folding into Effective Insns fixes false “wins”.
  • Keeping NumIVMuls high in the order preserves the important penalty for wide IV mul chains (problematic on gfx942).
  • Preheader work remains a lower priority; we accept modest setup if per-iteration work is not worse (often better).

Validation

  • Micro cases that previously regressed now choose pointer IV or a single IV with honest accounting of address work; per-iteration VALU pressure is reduced or unchanged.
  • A GlobalISel case that relies on fewer IV updates still selects the intended plan (no scale involved, so Effective Insns ties and AddRecCost wins).
  • AMDGPU lit tests with autogenerated checks were reviewed and selectively regenerated where code shape improved or shifted benignly. Intent-based tests (no crashes, correct sinking, correct divergence handling) remain valid.
  • Bench on gfx942 remains stable; no increase in actual loop work.

Notes

  • No LSR algorithms were changed; this is TTI cost modeling and plan ordering for AMDGPU.
  • Scope is limited to GFX9+.
  • Further tuning can be added later if we find unrolled-shape outliers, but current data shows this approach addresses the original regressions without harming other cases.

P.S. The commented-out debug messages will stay there till the final form (waiting for CQE extensive perf testing), then I'll remove them.

LSR on AMDGPU was selecting plans that reduce Setup/AddRec cost while
increasing per-iteration base additions (`NumBaseAdds`) in the loop
body.With minimal addressing modes, each extra `add i32` becomes a
separate VALU op and hurts throughput (observed on gfx942).

Teach GCNTTIImpl to compare LSR plans with `NumBaseAdds` and then
`Insns` as dominant criteria, ahead of preheader costs (`SetupCost`,
`AddRecCost`). Also return false from isNumRegsMajorCostOfLSR()
and true from shouldDropLSRSolutionIfLessProfitable(). This steers
LSR away from “-setup +more loop adds” trades; the reduced repro
now keeps baseline and a large kernel regains performance.
- Fixed a few lit-tests by limiting to GFX9 and above
- Regenerated no-crash tests
- Reorder GCNTTIImpl::isLSRCostLess() for GFX9+ to prioritize per-iteration work:
  1) Insns (lower is better)
  2) NumBaseAdds (tie-breaker to Insns)
  3) NumIVMuls (penalize mul/mul_hi/addc IV chains)
  4) AddRecCost, then SetupCost (preheader costs only if per-iter ties)
  5) Minor keys: ScaleCost, then NumRegs
- Keep pre-GFX9 behavior unchanged (fall back to BaseT::isLSRCostLess).
- Rationale: AMDGPU lacks rich addressing modes; we must avoid LSR “wins” that
  reduce setup but add per-iteration base-adds / wide IV mul chains. This ordering
  consistently favors plans with less per-iteration work and reduces VALU pressure
  on gfx9+ (e.g., gfx942). In practice LSR now often drops its “improvement” in
  favor of the baseline plan (as seen in -debug-only=lsr logs).

Tests:
- Regenerate `llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll` checks
  (kernel `introduced_copy_to_sgpr`). Assembly changed in a beneficial way:
  fewer VALU ops in the hot loops and lower register usage:
    num_vgpr: 28 → 25
    numbered_sgpr: 26 → 20
  No functional change to the test’s intent (no RA asserts; copy/RA behavior intact).
  Checks updated via `utils/update_llc_test_checks.py`.
… base-add count

Adjust the LSR cost comparison order for GFX9+ to prioritize per-iteration
IV efficiency over minimizing base-adds.

Previous order had NumBaseAdds checked early, causing the cost model to
prefer baseline solutions with multiple IVs (NumBaseAdds=0) over optimized
LSR solutions with fewer IVs but one base-add (NumBaseAdds=1). This resulted
in extra register moves between IVs in the loop body, increasing VALU pressure.

Changes to priority order for GFX9+:
1. Insns (total per-iteration instructions)
2. NumIVMuls (IV multiplication chains) <- moved above NumBaseAdds
3. AddRecCost (per-iteration IV updates) <- moved above NumBaseAdds
4. NumBaseAdds (base address adds) <- moved down
5. SetupCost (preheader costs)
6. ScaleCost, NumRegs (tie-breakers)

Rationale: AddRecCost directly measures per-iteration IV update cost
(fewer IVs = lower cost), which better reflects actual loop body work
than NumBaseAdds alone. This fixes cases where baseline with 2 IVs
generated more VALU instructions than LSR solution with 1 IV + base-add.
This patch refines the AMDGPU LSR cost model to better reflect the lack
of rich addressing modes on the architecture.

Key changes:

1. Implement getScalingFactorCost() to return 1 for base+scale*index
   addressing modes. While AMDGPU's isLegalAddressingMode() reports such
   modes as "legal", they require a separate ADD instruction unlike
   architectures with hardware-supported complex addressing.

2. Update isLSRCostLess() to use EffInsns = Insns + ScaleCost for the
   primary comparison. This ensures that LSR solutions using scaled
   addressing are properly penalized for the actual per-iteration work
   on AMDGPU.

New priority order for GFX9+:
  1. EffInsns (Insns + ScaleCost) - true per-iteration cost
  2. NumIVMuls - IV multiplication chains
  3. AddRecCost - IV update cost
  4. NumBaseAdds - base address additions
  5. SetupCost - preheader costs
  6. ImmCost, NumRegs - tie-breakers

Lit tests were carefully analyzed before regeneration to ensure:
- No functional regressions (all test objectives preserved)
- No incorrect machine sinking or divergence handling issues
- Acceptable trade-offs (preheader overhead vs per-iteration efficiency)
- No register spills or unnecessary barriers introduced
@z1-cciauto
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants