Skip to content

Conversation

@adelejjeh
Copy link

@adelejjeh adelejjeh commented Nov 26, 2025

The purpose of this PR is for testing and internal discussion only. Upstream PR will be opened soon.

Summary

This patch fixes an optimization issue where multiply operations were being contracted into FMA (Fused Multiply-Add) instructions even when the multiply had multiple non-contractable uses, resulting in code duplication without reducing total operation count.

Problem: The DAGCombiner and GlobalISel would create FMAs from patterns like fadd(fmul(x, y), z) even when fmul(x, y) had other uses that prevented it from being removed. This caused the multiply to be duplicated (once as part of FMA, once for other uses) without any net performance benefit.

Solution: Before contracting into FMA, verify that ALL uses of the multiply can be contracted. The implementation checks each use of the multiply against known contractable patterns. One assumption being made is when a use is an FMA/FMAD -- The conditions for when a multiply gets contracted into the FMA/FMAD are very complex and tracking them all would be complicated. Instead, a more conservative approach was taken, which is consistent with the existing behavior, so that we always assume the FMA/FMAD use to be contractable. This may lead to some edge-cases where we still generate an FMA from a multiply+add even if the other use of a multiply is an FMA that the multiply will not be contracted into; but that is not different from the existing behavior. The entire list of patterns being checked:

  1. Direct contraction: FMUL → FADD/FSUB (when fusion is allowed globally or via contract flags)
  2. FNEG folding: FMUL → FNEG → FSUB (FNEG folds into FMA with negated operand)
  3. FPEXT folding: FMUL → FPEXT → {FADD, FSUB, FMA, FMAD} (when target supports via isFPExtFoldable)
  4. Nested patterns: FMUL → FNEG → FPEXT → FSUB and FMUL → FPEXT → FNEG → FSUB
  5. FMA chains: FMUL → {FMA, FMAD} (assumed contractable to maintain existing behavior)

Only if all uses match these patterns will contraction proceed, ensuring the multiply will be eliminated rather than duplicated.

Changes

Modified Files

  1. llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp - Modified to check multiply uses before generating FMA when lowering using SDAG.

  2. llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp - Modified to check multiply uses before generating FMA when lowering using GlobalISEL.

  3. llvm/lib/Target/AMDGPU/SIISelLowering.cpp - Modified to check multiply uses before generating FMA when falling back to AMDGPU target-specific lowering.

Added a new lit test:

  1. llvm/test/CodeGen/AMDGPU/fma-multiple-uses-contraction.ll - New comprehensive test file with multiple test cases validating contraction behavior

Updated existing lit tests:

  1. llvm/test/CodeGen/AMDGPU/fma.f16.ll

  2. llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll

  3. llvm/test/CodeGen/AMDGPU/fpext-free.ll

  4. llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll

  5. llvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.ll

Side effect: Several existing tests show improved codegen with reduced instruction counts.

Example

Before (incorrect contraction with code duplication):

%mul = fmul float %x, %y
%use1 = fmul float %mul, %mul  ; Non-contractable use
%add = fadd float %mul, %z     ; Would contract to FMA, duplicating multiply

Generated: v_mul + v_mul + v_fma (3 instructions, multiply duplicated)

After (contraction skipped, multiply reused):

%mul = fmul float %x, %y
%use1 = fmul float %mul, %mul  ; Non-contractable use
%add = fadd float %mul, %z     ; Contraction skipped

Generated: v_mul + v_mul + v_add (3 instructions, multiply shared)

Performance Impact

I ran my change through SPECHPC. Results show overall slight improvement (geomean 1.08x), with 2/9 benchmarks showing a slight slowdown (-12% and -3%) and 5/9 showing speedups ranging from 3.8% up to 40%. This is on MI300A. Results can be seen here: No_Mul_Duplication_When_Contracting_results.xlsx

@github-actions
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@z1-cciauto
Copy link
Collaborator

@adelejjeh adelejjeh requested review from arsenm, jayfoad and rampitec and removed request for arsenm, jayfoad and rampitec November 26, 2025 22:15
@jayfoad
Copy link

jayfoad commented Nov 27, 2025

The idea seems reasonable to me but note it is somewhat at odds with the original motivation for the "aggressive" option:
llvm@62ac736

[...] The heuristic used by DAGCombine to form FMAs checks that the FMUL has only one
use, but this is overly-conservative on some systems. Specifically, if the FMA
and the FADD have the same latency (and the FMA does not compete for resources
with the FMUL any more than the FADD does), there is no need for the
restriction, and furthermore, forming the FMA leaving the FMUL can still allow
for higher overall throughput and decreased critical-path length
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants