forked from llvm/llvm-project
-
Notifications
You must be signed in to change notification settings - Fork 75
[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
michaelselehov
wants to merge
6
commits into
amd-staging
Choose a base branch
from
amd/dev/mselehov/amdgpu-lsr-cost-model
base: amd-staging
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #510
michaelselehov
wants to merge
6
commits into
amd-staging
from
amd/dev/mselehov/amdgpu-lsr-cost-model
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
LSR on AMDGPU was selecting plans that reduce Setup/AddRec cost while increasing per-iteration base additions (`NumBaseAdds`) in the loop body.With minimal addressing modes, each extra `add i32` becomes a separate VALU op and hurts throughput (observed on gfx942). Teach GCNTTIImpl to compare LSR plans with `NumBaseAdds` and then `Insns` as dominant criteria, ahead of preheader costs (`SetupCost`, `AddRecCost`). Also return false from isNumRegsMajorCostOfLSR() and true from shouldDropLSRSolutionIfLessProfitable(). This steers LSR away from “-setup +more loop adds” trades; the reduced repro now keeps baseline and a large kernel regains performance.
- Fixed a few lit-tests by limiting to GFX9 and above - Regenerated no-crash tests
- Reorder GCNTTIImpl::isLSRCostLess() for GFX9+ to prioritize per-iteration work:
1) Insns (lower is better)
2) NumBaseAdds (tie-breaker to Insns)
3) NumIVMuls (penalize mul/mul_hi/addc IV chains)
4) AddRecCost, then SetupCost (preheader costs only if per-iter ties)
5) Minor keys: ScaleCost, then NumRegs
- Keep pre-GFX9 behavior unchanged (fall back to BaseT::isLSRCostLess).
- Rationale: AMDGPU lacks rich addressing modes; we must avoid LSR “wins” that
reduce setup but add per-iteration base-adds / wide IV mul chains. This ordering
consistently favors plans with less per-iteration work and reduces VALU pressure
on gfx9+ (e.g., gfx942). In practice LSR now often drops its “improvement” in
favor of the baseline plan (as seen in -debug-only=lsr logs).
Tests:
- Regenerate `llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll` checks
(kernel `introduced_copy_to_sgpr`). Assembly changed in a beneficial way:
fewer VALU ops in the hot loops and lower register usage:
num_vgpr: 28 → 25
numbered_sgpr: 26 → 20
No functional change to the test’s intent (no RA asserts; copy/RA behavior intact).
Checks updated via `utils/update_llc_test_checks.py`.
… base-add count Adjust the LSR cost comparison order for GFX9+ to prioritize per-iteration IV efficiency over minimizing base-adds. Previous order had NumBaseAdds checked early, causing the cost model to prefer baseline solutions with multiple IVs (NumBaseAdds=0) over optimized LSR solutions with fewer IVs but one base-add (NumBaseAdds=1). This resulted in extra register moves between IVs in the loop body, increasing VALU pressure. Changes to priority order for GFX9+: 1. Insns (total per-iteration instructions) 2. NumIVMuls (IV multiplication chains) <- moved above NumBaseAdds 3. AddRecCost (per-iteration IV updates) <- moved above NumBaseAdds 4. NumBaseAdds (base address adds) <- moved down 5. SetupCost (preheader costs) 6. ScaleCost, NumRegs (tie-breakers) Rationale: AddRecCost directly measures per-iteration IV update cost (fewer IVs = lower cost), which better reflects actual loop body work than NumBaseAdds alone. This fixes cases where baseline with 2 IVs generated more VALU instructions than LSR solution with 1 IV + base-add.
This patch refines the AMDGPU LSR cost model to better reflect the lack of rich addressing modes on the architecture. Key changes: 1. Implement getScalingFactorCost() to return 1 for base+scale*index addressing modes. While AMDGPU's isLegalAddressingMode() reports such modes as "legal", they require a separate ADD instruction unlike architectures with hardware-supported complex addressing. 2. Update isLSRCostLess() to use EffInsns = Insns + ScaleCost for the primary comparison. This ensures that LSR solutions using scaled addressing are properly penalized for the actual per-iteration work on AMDGPU. New priority order for GFX9+: 1. EffInsns (Insns + ScaleCost) - true per-iteration cost 2. NumIVMuls - IV multiplication chains 3. AddRecCost - IV update cost 4. NumBaseAdds - base address additions 5. SetupCost - preheader costs 6. ImmCost, NumRegs - tie-breakers Lit tests were carefully analyzed before regeneration to ensure: - No functional regressions (all test objectives preserved) - No incorrect machine sinking or divergence handling issues - Acceptable trade-offs (preheader overhead vs per-iteration efficiency) - No register spills or unnecessary barriers introduced
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
AMDGPU: reflect lack of complex addressing in LSR; fix regressions after SROA copy-load promotion
Background
Recent SROA changes promote copy-load style IR more aggressively. As a side effect, LoopStrengthReduce (LSR) now triggers on loops where it previously stayed quiet. On AMDGPU (no rich base+index addressing modes), generic LSR heuristics could pick plans that looked cheaper but expanded into extra address formation work in the loop body, causing performance regressions.
Problem
Solution
getScalingFactorCostinGCNTTIImplto model scaled addressing:If
HasBaseReg && Scale != 0, return cost 1 (one extra per-iteration instruction).isLSRCostLess(GFX9+):Effective Insns = Insns + ScaleCostThis makes scaled address formation count as real per-iteration work.
Priority order on GFX9+
Rationale
Validation
Notes
P.S. The commented-out debug messages will stay there till the final form (waiting for CQE extensive perf testing), then I'll remove them.