[AMDGPU][DAGCombiner][GlobalISel] Prevent FMA contraction when multiply cannot be eliminated #693
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The purpose of this PR is for testing and internal discussion only. Upstream PR will be opened soon.
Summary
This patch fixes an optimization issue where multiply operations were being contracted into FMA (Fused Multiply-Add) instructions even when the multiply had multiple non-contractable uses, resulting in code duplication without reducing total operation count.
Problem: The DAGCombiner and GlobalISel would create FMAs from patterns like
fadd(fmul(x, y), z)even whenfmul(x, y)had other uses that prevented it from being removed. This caused the multiply to be duplicated (once as part of FMA, once for other uses) without any net performance benefit.Solution: Before contracting into FMA, verify that ALL uses of the multiply can be contracted. The implementation checks each use of the multiply against known contractable patterns. One assumption being made is when a use is an FMA/FMAD -- The conditions for when a multiply gets contracted into the FMA/FMAD are very complex and tracking them all would be complicated. Instead, a more conservative approach was taken, which is consistent with the existing behavior, so that we always assume the FMA/FMAD use to be contractable. This may lead to some edge-cases where we still generate an FMA from a multiply+add even if the other use of a multiply is an FMA that the multiply will not be contracted into; but that is not different from the existing behavior. The entire list of patterns being checked:
FMUL → FADD/FSUB(when fusion is allowed globally or via contract flags)FMUL → FNEG → FSUB(FNEG folds into FMA with negated operand)FMUL → FPEXT → {FADD, FSUB, FMA, FMAD}(when target supports viaisFPExtFoldable)FMUL → FNEG → FPEXT → FSUBandFMUL → FPEXT → FNEG → FSUBFMUL → {FMA, FMAD}(assumed contractable to maintain existing behavior)Only if all uses match these patterns will contraction proceed, ensuring the multiply will be eliminated rather than duplicated.
Changes
Modified Files
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp- Modified to check multiply uses before generating FMA when lowering using SDAG.llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp- Modified to check multiply uses before generating FMA when lowering using GlobalISEL.llvm/lib/Target/AMDGPU/SIISelLowering.cpp- Modified to check multiply uses before generating FMA when falling back to AMDGPU target-specific lowering.Added a new lit test:
llvm/test/CodeGen/AMDGPU/fma-multiple-uses-contraction.ll- New comprehensive test file with multiple test cases validating contraction behaviorUpdated existing lit tests:
llvm/test/CodeGen/AMDGPU/fma.f16.llllvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.llllvm/test/CodeGen/AMDGPU/fpext-free.llllvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.llllvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.llSide effect: Several existing tests show improved codegen with reduced instruction counts.
Example
Before (incorrect contraction with code duplication):
Generated:
v_mul + v_mul + v_fma(3 instructions, multiply duplicated)After (contraction skipped, multiply reused):
Generated:
v_mul + v_mul + v_add(3 instructions, multiply shared)Performance Impact
I ran my change through SPECHPC. Results show overall slight improvement (geomean 1.08x), with 2/9 benchmarks showing a slight slowdown (-12% and -3%) and 5/9 showing speedups ranging from 3.8% up to 40%. This is on MI300A. Results can be seen here: No_Mul_Duplication_When_Contracting_results.xlsx