[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #510

michaelselehov · 2025-11-06T15:28:47Z

AMDGPU: reflect lack of complex addressing in LSR; fix regressions after SROA copy-load promotion

Background

Recent SROA changes promote copy-load style IR more aggressively. As a side effect, LoopStrengthReduce (LSR) now triggers on loops where it previously stayed quiet. On AMDGPU (no rich base+index addressing modes), generic LSR heuristics could pick plans that looked cheaper but expanded into extra address formation work in the loop body, causing performance regressions.

Problem

A scaled index with a base register is not free on AMDGPU; it becomes an extra per-iteration add.
LSR’s Insns metric did not count this address work, so plans could “win” by 1 instruction in theory but actually add a v_add in the loop.
On GFX9+ we care more about per-iteration work than about small, one-time preheader setup.

Solution

Implement getScalingFactorCost in GCNTTIImpl to model scaled addressing:
If HasBaseReg && Scale != 0, return cost 1 (one extra per-iteration instruction).
Use Effective Insns as the top metric in isLSRCostLess (GFX9+):
Effective Insns = Insns + ScaleCost
This makes scaled address formation count as real per-iteration work.

Priority order on GFX9+

Effective Insns (Insns + ScaleCost)
NumIVMuls (penalize mul/mul_hi/addc IV chains)
AddRecCost (prefer fewer IV updates)
NumBaseAdds
SetupCost
ImmCost
NumRegs

Pre-GFX9 behavior is unchanged (fall back to base implementation).

Rationale

AMDGPU lacks hardware complex addressing. Base + (iv * scale) requires an extra instruction; modeling that as ScaleCost = 1 and folding into Effective Insns fixes false “wins”.
Keeping NumIVMuls high in the order preserves the important penalty for wide IV mul chains (problematic on gfx942).
Preheader work remains a lower priority; we accept modest setup if per-iteration work is not worse (often better).

Validation

Micro cases that previously regressed now choose pointer IV or a single IV with honest accounting of address work; per-iteration VALU pressure is reduced or unchanged.
A GlobalISel case that relies on fewer IV updates still selects the intended plan (no scale involved, so Effective Insns ties and AddRecCost wins).
AMDGPU lit tests with autogenerated checks were reviewed and selectively regenerated where code shape improved or shifted benignly. Intent-based tests (no crashes, correct sinking, correct divergence handling) remain valid.
Bench on gfx942 remains stable; no increase in actual loop work.

Notes

No LSR algorithms were changed; this is TTI cost modeling and plan ordering for AMDGPU.
Scope is limited to GFX9+.
Further tuning can be added later if we find unrolled-shape outliers, but current data shows this approach addresses the original regressions without harming other cases.

P.S. The commented-out debug messages will stay there till the final form (waiting for CQE extensive perf testing), then I'll remove them.

LSR on AMDGPU was selecting plans that reduce Setup/AddRec cost while increasing per-iteration base additions (`NumBaseAdds`) in the loop body.With minimal addressing modes, each extra `add i32` becomes a separate VALU op and hurts throughput (observed on gfx942). Teach GCNTTIImpl to compare LSR plans with `NumBaseAdds` and then `Insns` as dominant criteria, ahead of preheader costs (`SetupCost`, `AddRecCost`). Also return false from isNumRegsMajorCostOfLSR() and true from shouldDropLSRSolutionIfLessProfitable(). This steers LSR away from “-setup +more loop adds” trades; the reduced repro now keeps baseline and a large kernel regains performance.

- Fixed a few lit-tests by limiting to GFX9 and above - Regenerated no-crash tests

- Reorder GCNTTIImpl::isLSRCostLess() for GFX9+ to prioritize per-iteration work: 1) Insns (lower is better) 2) NumBaseAdds (tie-breaker to Insns) 3) NumIVMuls (penalize mul/mul_hi/addc IV chains) 4) AddRecCost, then SetupCost (preheader costs only if per-iter ties) 5) Minor keys: ScaleCost, then NumRegs - Keep pre-GFX9 behavior unchanged (fall back to BaseT::isLSRCostLess). - Rationale: AMDGPU lacks rich addressing modes; we must avoid LSR “wins” that reduce setup but add per-iteration base-adds / wide IV mul chains. This ordering consistently favors plans with less per-iteration work and reduces VALU pressure on gfx9+ (e.g., gfx942). In practice LSR now often drops its “improvement” in favor of the baseline plan (as seen in -debug-only=lsr logs). Tests: - Regenerate `llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll` checks (kernel `introduced_copy_to_sgpr`). Assembly changed in a beneficial way: fewer VALU ops in the hot loops and lower register usage: num_vgpr: 28 → 25 numbered_sgpr: 26 → 20 No functional change to the test’s intent (no RA asserts; copy/RA behavior intact). Checks updated via `utils/update_llc_test_checks.py`.

… base-add count Adjust the LSR cost comparison order for GFX9+ to prioritize per-iteration IV efficiency over minimizing base-adds. Previous order had NumBaseAdds checked early, causing the cost model to prefer baseline solutions with multiple IVs (NumBaseAdds=0) over optimized LSR solutions with fewer IVs but one base-add (NumBaseAdds=1). This resulted in extra register moves between IVs in the loop body, increasing VALU pressure. Changes to priority order for GFX9+: 1. Insns (total per-iteration instructions) 2. NumIVMuls (IV multiplication chains) <- moved above NumBaseAdds 3. AddRecCost (per-iteration IV updates) <- moved above NumBaseAdds 4. NumBaseAdds (base address adds) <- moved down 5. SetupCost (preheader costs) 6. ScaleCost, NumRegs (tie-breakers) Rationale: AddRecCost directly measures per-iteration IV update cost (fewer IVs = lower cost), which better reflects actual loop body work than NumBaseAdds alone. This fixes cases where baseline with 2 IVs generated more VALU instructions than LSR solution with 1 IV + base-add.

This patch refines the AMDGPU LSR cost model to better reflect the lack of rich addressing modes on the architecture. Key changes: 1. Implement getScalingFactorCost() to return 1 for base+scale*index addressing modes. While AMDGPU's isLegalAddressingMode() reports such modes as "legal", they require a separate ADD instruction unlike architectures with hardware-supported complex addressing. 2. Update isLSRCostLess() to use EffInsns = Insns + ScaleCost for the primary comparison. This ensures that LSR solutions using scaled addressing are properly penalized for the actual per-iteration work on AMDGPU. New priority order for GFX9+: 1. EffInsns (Insns + ScaleCost) - true per-iteration cost 2. NumIVMuls - IV multiplication chains 3. AddRecCost - IV update cost 4. NumBaseAdds - base address additions 5. SetupCost - preheader costs 6. ImmCost, NumRegs - tie-breakers Lit tests were carefully analyzed before regeneration to ensure: - No functional regressions (all test objectives preserved) - No incorrect machine sinking or divergence handling issues - Acceptable trade-offs (preheader overhead vs per-iteration efficiency) - No register spills or unnecessary barriers introduced

z1-cciauto · 2025-11-06T15:29:49Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2692

michaelselehov added 6 commits November 6, 2025 09:27

some lit fixes

55bf791

- Fixed a few lit-tests by limiting to GFX9 and above - Regenerated no-crash tests

regen test

d997a52

michaelselehov added the testing only label Nov 6, 2025

michaelselehov requested review from akadutta, dhruvachak and hidekisaito November 6, 2025 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #510

[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #510

Uh oh!

michaelselehov commented Nov 6, 2025

Uh oh!

z1-cciauto commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #510

Are you sure you want to change the base?

[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #510

Uh oh!

Conversation

michaelselehov commented Nov 6, 2025

AMDGPU: reflect lack of complex addressing in LSR; fix regressions after SROA copy-load promotion

Background

Problem

Solution

Priority order on GFX9+

Rationale

Validation

Notes

Uh oh!

z1-cciauto commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants