-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[SROA] Propagate no-signed-zeros(nsz) fast-math flag on the phi node using function attribute #83381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-transforms Author: Yashwant Singh (yashssh) ChangesBased on these C/C++ patterns when compiled with 'Ofast' return X > 0.0 ? X : -X; InstCombine tries to propogate FMF to 'select' instructions before attempting a fold, but it can't safely propgate 'nsz' hence wasn't performing the optimization. OOTH we should be able to do this optimization at 'Ofast' same as gcc(https://godbolt.org/z/c69fe5fa6). Bit of a workaround but this patch allows us to query the "no-signed-zeroes" function attribute added to the function during 'Ofast' compilation. Allowing instcombine to safely match the idiom. Full diff: https://github.com/llvm/llvm-project/pull/83381.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp b/llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
index 71fa9b9ba41ebb..bd56d92d5d3d0f 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
@@ -2742,7 +2742,11 @@ static Instruction *foldSelectWithFCmpToFabs(SelectInst &SI,
// Note: We require "nnan" for this fold because fcmp ignores the signbit
// of NAN, but IEEE-754 specifies the signbit of NAN values with
// fneg/fabs operations.
- if (!SI.hasNoSignedZeros() || !SI.hasNoNaNs())
+ if (!SI.hasNoNaNs())
+ return nullptr;
+
+ bool functionHasNoSignedZeroes = SI.getParent()->getParent()->hasFnAttribute("no-signed-zeros-fp-math");
+ if(!functionHasNoSignedZeroes && !SI.hasNoSignedZeros())
return nullptr;
if (Swap)
diff --git a/llvm/test/Transforms/InstCombine/fabs.ll b/llvm/test/Transforms/InstCombine/fabs.ll
index 7e380c2e4590a0..88b02a852f3d74 100644
--- a/llvm/test/Transforms/InstCombine/fabs.ll
+++ b/llvm/test/Transforms/InstCombine/fabs.ll
@@ -547,6 +547,20 @@ define double @select_fcmp_nnan_nsz_ult_zero_unary_fneg(double %x) {
ret double %fabs
}
+
+define float @absfloat32f_olt_fast_no_signed_zeroes(float %x) "no-signed-zeros-fp-math" {
+; CHECK-LABEL: @absfloat32f_olt_fast_no_signed_zeroes(
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[RETVAL_0:%.*]] = call nnan ninf float @llvm.fabs.f32(float [[X:%.*]])
+; CHECK-NEXT: ret float [[RETVAL_0]]
+;
+entry:
+ %cmp = fcmp fast olt float %x, 0.000000e+00
+ %fneg = fneg fast float %x
+ %retval.0 = select i1 %cmp, float %fneg, float %x
+ ret float %retval.0
+}
+
; X < -0.0 ? -X : X --> fabs(X)
define float @select_fcmp_nnan_nsz_olt_negzero(float %x) {
@@ -839,6 +853,19 @@ define <2 x float> @select_fcmp_nnan_nsz_ugt_zero_unary_fneg(<2 x float> %x) {
ret <2 x float> %fabs
}
+define float @absfloat32f_ogt_fast_no_signed_zeroes(float %x) "no-signed-zeros-fp-math" {
+; CHECK-LABEL: @absfloat32f_ogt_fast_no_signed_zeroes(
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[RETVAL_0:%.*]] = call nnan ninf float @llvm.fabs.f32(float [[X:%.*]])
+; CHECK-NEXT: ret float [[RETVAL_0]]
+;
+entry:
+ %cmp = fcmp fast ogt float %x, 0.000000e+00
+ %fneg = fneg fast float %x
+ %retval.0 = select i1 %cmp, float %x, float %fneg
+ ret float %retval.0
+}
+
; X > -0.0 ? X : (0.0 - X) --> fabs(X)
define half @select_fcmp_nnan_nsz_ogt_negzero(half %x) {
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
The original inspiration for the missed optimization can be seen here https://godbolt.org/z/z3Es5oxqv. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether we should be propagating the function-level attributes to instruction flags somewhere so that this is handled generically, rather than in individual folds?
TBH I think fast-math flags should only be function-level attributes :( |
Will that be more efficient? What about function inlining when we end up losing these attributes? |
That can be done but will be a separate work altogether, my understanding is somewhere early in the pipeline. I was also suggested that it would require allowing |
Is there any real-world project whose translation units/functions are compiled with different fast math flags? |
We have pragmas that allow you to turn fast-math on an off locally, at a scope level, so even within a function the fast-math flags can change. I think this is an important capability. One of the top concerns I hear from customers who are using fast-math is that they need the ability to track down optimizations that are throwing their results too far out of range. A first step to tracking this down is to stop using fast-math for individual files or sets of files to bisect the problem down to which file is causing a failure. From there, it gets harder, but pragmas can help. I'd like to introduce a change to the way we check fast-math flags that would let us insert some kind of debug counter to narrow it down in a way that would be similar to how opt-bisect works but effective at the individual transformation granularity. There are some obstacles to making that happen, but I think that would be an extremely beneficial capability. Once you find the place where an accuracy-related failure is being introduced, my idea is to have some way (probably based on pragmas) to turn fast-math off for that location only in the code without needing the debug counter. That's a bit of a pipe dream at this point, but I want to see us working towards that. This is one of the reasons I don't like fast-math function attributes, but the way things currently work is that the attribute isn't set if the relevant fast-math flag is turned off anywhere in the function. The front end handles this with pragmas, but it means that the inliner needs to remove the attribute if inlining calls where the callee and caller don't both have it set. |
The problem is in deciding where to do that. I don't like the function attributes and have made a few attempts to have them deprecated, but there are cases that couldn't be solved without them. This case, as detailed in #51601, is one such example. The front end can't set the fast-math flags on load instructions, so they never make it to the phi instruction that gets created. Propogating the flags from the attribute to instructions wouldn't help unless it was done after SROA and/or SimplifyCFG, but maybe those passes could add the flag based on the function-level attribute. That might be a better way to fix this. |
I'm leaving this pull request open for now while simultaneously looking if I can set FMF in SROA, will return back with whatever I find. |
This is obviously a contrived example, but here is a case that will not be optimized if you rely on the function attributes but could be optimized if the flags were propagated by SROA. |
Thanks for the example @andykaylor! I tried a change that sets
Is this the direction you were hinting towards? |
Should add the Fixes: issue number to the description |
ping @arsenm |
I do think we should have the phase ordering test |
Although I didn't have a PhaseOrdering test, I had initially added a test that showed the transformation after the concerned passes here and @jcranmer-intel insisted there that this is not required. |
The right way to do a phase ordering test is not the way you did it. There's a folder test/Transforms/PhaseOrdering directory where the phase-ordering tests go, and in general, you would probably want to check a test like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not great but was the consensus on the ticket. The phase orderingtest could use a comment and rename though
…using function attribute Its expected that the sequence return X > 0.0 ? X : -X, compiled with -Ofast, produces fabs intrinsic. However, at this point, LLVM is unable to do so. The above sequence goes through the following transformation during the pass pipeline: SROA pass generates the phi node. Here, it does not infer the fast-math flags on the phi node unlike clang frontend typically does. Phi node eventually gets translated into select instruction. Because of missing no-signed-zeros(nsz) fast-math flag on the select instruction, InstCombine pass fails to fold the sequence into fabs intrinsic. This patch, as a part of SROA, tries to propagate nsz fast-math flag on the phi node using function attribute enabling this folding. Co-authored-by: Sushant Gokhale <sgokhale@nvidia.com>
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/72/builds/683 Here is the relevant piece of the build log for the reference:
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/141/builds/428 Here is the relevant piece of the build log for the reference:
|
llvm#83381 introduced a call to PHINode::isComplete() in Mem2Reg, which is O(n^2) in the number of predecessors, resulting in pathological compile-time regressions for cases with many predecessors. I believe this was intended as a compile-time optimization in the first place, so just drop the check, and always try to transfer the attribute.
#83381 introduced a call to PHINode::isComplete() in Mem2Reg, which is O(n^2) in the number of predecessors, resulting in pathological compile-time regressions for cases with many predecessors. Remove the isComplete() check and instead cache the attribute lookup, to only perform it once per function. Actually setting the FMF flag is cheap.
…using function attribute (llvm#83381) Its expected that the sequence `return X > 0.0 ? X : -X`, compiled with -Ofast, produces fabs intrinsic. However, at this point, LLVM is unable to do so. The above sequence goes through the following transformation during the pass pipeline: 1) SROA pass generates the phi node. Here, it does not infer the fast-math flags on the phi node unlike clang frontend typically does. 2) Phi node eventually gets translated into select instruction. Because of missing no-signed-zeros(nsz) fast-math flag on the select instruction, InstCombine pass fails to fold the sequence into fabs intrinsic. This patch, as a part of SROA, tries to propagate nsz fast-math flag on the phi node using function attribute enabling this folding. Closes llvm#51601 Co-authored-by: Sushant Gokhale <sgokhale@nvidia.com>
llvm#83381 introduced a call to PHINode::isComplete() in Mem2Reg, which is O(n^2) in the number of predecessors, resulting in pathological compile-time regressions for cases with many predecessors. Remove the isComplete() check and instead cache the attribute lookup, to only perform it once per function. Actually setting the FMF flag is cheap.
…using function attribute (llvm#83381) Its expected that the sequence `return X > 0.0 ? X : -X`, compiled with -Ofast, produces fabs intrinsic. However, at this point, LLVM is unable to do so. The above sequence goes through the following transformation during the pass pipeline: 1) SROA pass generates the phi node. Here, it does not infer the fast-math flags on the phi node unlike clang frontend typically does. 2) Phi node eventually gets translated into select instruction. Because of missing no-signed-zeros(nsz) fast-math flag on the select instruction, InstCombine pass fails to fold the sequence into fabs intrinsic. This patch, as a part of SROA, tries to propagate nsz fast-math flag on the phi node using function attribute enabling this folding. Closes llvm#51601 Co-authored-by: Sushant Gokhale <sgokhale@nvidia.com>
llvm#83381 introduced a call to PHINode::isComplete() in Mem2Reg, which is O(n^2) in the number of predecessors, resulting in pathological compile-time regressions for cases with many predecessors. Remove the isComplete() check and instead cache the attribute lookup, to only perform it once per function. Actually setting the FMF flag is cheap.
Alive2 complains about this commit: define double @fabs_fcmp_olt_nsz_func_attr(double %#0, double %#1) {
entry:
%x = alloca i64 8, align 8
store double %#0, ptr %x, align 8
%cmp = fcmp nnan nsz olt double %#0, 0.000000
br i1 %cmp, label %if.then, label %return
if.then:
%fneg = fneg nnan nsz double %#0
store double %fneg, ptr %x, align 8
br label %return
return:
%ret = load double, ptr %x, align 8
ret double %ret
}
=>
define double @fabs_fcmp_olt_nsz_func_attr(double %#0, double %#1) {
entry:
%cmp = fcmp nnan nsz olt double %#0, 0.000000
br i1 %cmp, label %if.then, label %return
if.then:
%fneg = fneg nnan nsz double %#0
br label %return
return:
%x.0 = phi nsz double [ %fneg, %if.then ], [ %#0, %entry ]
ret double %x.0
}
Transformation doesn't verify! (unsound)
ERROR: Target's return value is more undefined
Example:
double %#0 = #x0000000000000000 (+0.0)
double %#1 = poison
Source:
ptr %x = pointer(local, block_id=0, offset=0)
i1 %cmp = #x0 (0)
>> Jump to %return
double %ret = #x0000000000000000 (+0.0)
Target:
i1 %cmp = #x0 (0)
>> Jump to %return
double %x.0 = #x8000000000000000 (-0.0)
Source value: #x0000000000000000 (+0.0)
Target value: #x8000000000000000 (-0.0) |
@nunoplopes As far as I can tell, alive2 does not have support for the |
Ah, thanks, I missed that. |
…he nsz flag Following the change introduced by the PR llvm#83381, this patch extends it with the same treatment of the nnan fast-math flag. This is to address the performance drop caused by PR#83200 which prevented vital InstCombine transformation due to the lack of relevant fast-math flags. The PromoteMem2Reg utility is used by the SROA pass, where Phi nodes are being created. Proposed change allows propagation of the nnan flag down to these Phi nodes.
…he nsz flag Following the change introduced by the PR llvm#83381, this patch extends it with the same treatment of the nnan fast-math flag. This is to address the performance drop caused by PR#83200 which prevented vital InstCombine transformation due to the lack of relevant fast-math flags. The PromoteMem2Reg utility is used by the SROA pass, where Phi nodes are being created. Proposed change allows propagation of the nnan flag down to these Phi nodes.
…par with the nsz flag Following the change introduced by the PR llvm#83381, this patch extends it with the same treatment of the nnan and ninf fast-math flags. This is to address the performance drop caused by PR#83200 which prevented vital InstCombine transformation due to the lack of the relevant fast-math flag. The PromoteMem2Reg utility is used by the SROA pass, where Phi nodes are being created. Proposed change allows propagation of the nnan and ninf flags down to these Phi nodes.
…par with the nsz flag Following the change introduced by the PR llvm#83381, this patch extends it with the same treatment of the nnan and ninf fast-math flags. This is to address the performance drop caused by PR#83200 which prevented vital InstCombine transformation due to the lack of the relevant fast-math flag. The PromoteMem2Reg utility is used by the SROA pass, where Phi nodes are being created. Proposed change allows propagation of the nnan and ninf flags down to these Phi nodes.
Its expected that the sequence
return X > 0.0 ? X : -X
, compiled with -Ofast, produces fabs intrinsic. However, at this point, LLVM is unable to do so.The above sequence goes through the following transformation during the pass pipeline:
Because of missing no-signed-zeros(nsz) fast-math flag on the select instruction, InstCombine pass fails to fold the sequence into fabs intrinsic.
This patch, as a part of SROA, tries to propagate nsz fast-math flag on the phi node using function attribute enabling this folding.
Closes #51601
Co-authored-by: Sushant Gokhale sgokhale@nvidia.com