-
Notifications
You must be signed in to change notification settings - Fork 13.7k
[InstCombine] Avoid folding select(umin(X, Y), X)
with min/max values in false arm
#143020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[InstCombine] Avoid folding select(umin(X, Y), X)
with min/max values in false arm
#143020
Conversation
select(umin(X, Y), X)
with non-constant mask
@llvm/pr-subscribers-llvm-transforms Author: Konstantin Bogdanov (thevar1able) ChangesFixes #139050. Please suggest a better fix if I'm doing something wrong, I have no understanding of InstCombine optimization. The issue is that the following snippet:
is being optimized to the following code, which can not be vectorized later.
Expected:
https://godbolt.org/z/cYMheKE5r Full diff: https://github.com/llvm/llvm-project/pull/143020.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
index cfb4af391b540..930819d24393a 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
@@ -1654,6 +1654,25 @@ Instruction *InstCombinerImpl::visitCallInst(CallInst &CI) {
if (Value *FreedOp = getFreedOperand(&CI, &TLI))
return visitFree(CI, FreedOp);
+ if (Function *F = CI.getCalledFunction()) {
+ if (F->getIntrinsicID() == Intrinsic::umin || F->getIntrinsicID() == Intrinsic::umax) {
+ for (Value *Arg : CI.args()) {
+ auto *SI = dyn_cast<SelectInst>(Arg);
+ if (!SI)
+ continue;
+
+ auto *TrueC = dyn_cast<Constant>(SI->getTrueValue());
+ auto *FalseC = dyn_cast<Constant>(SI->getFalseValue());
+
+ // Block only if the select is masking, e.g. select(cond, val, -1)
+ if ((TrueC && TrueC->isAllOnesValue()) || (FalseC && FalseC->isAllOnesValue())) {
+ LLVM_DEBUG(dbgs() << "InstCombine: skipping umin/umax folding for masked select\n");
+ return nullptr;
+ }
+ }
+ }
+ }
+
// If the caller function (i.e. us, the function that contains this CallInst)
// is nounwind, mark the call as nounwind, even if the callee isn't.
if (CI.getFunction()->doesNotThrow() && !CI.doesNotThrow()) {
diff --git a/llvm/test/Transforms/InstCombine/select.ll b/llvm/test/Transforms/InstCombine/select.ll
index e16f6ad2cfc9b..09cb84cde07ca 100644
--- a/llvm/test/Transforms/InstCombine/select.ll
+++ b/llvm/test/Transforms/InstCombine/select.ll
@@ -5047,3 +5047,18 @@ define <2 x ptr> @select_freeze_constant_expression_vector_gep(i1 %cond, <2 x pt
%sel = select i1 %cond, <2 x ptr> %y, <2 x ptr> %freeze
ret <2 x ptr> %sel
}
+
+declare i8 @llvm.umin.i8(i8, i8)
+
+define i8 @no_fold_masked_min(i8 %acc, i8 %val, i8 %mask) {
+; CHECK-LABEL: @no_fold_masked_min(
+; CHECK-NEXT: [[COND:%.*]] = icmp eq i8 [[MASK:%.*]], 0
+; CHECK-NEXT: [[MASKED_VAL:%.*]] = select i1 [[COND:%.*]], i8 [[VAL:%.*]], i8 -1
+; CHECK-NEXT: [[RES:%.*]] = call i8 @llvm.umin.i8(i8 [[ACC:%.*]], i8 [[MASKED_VAL:%.*]])
+; CHECK-NEXT: ret i8 [[RES]]
+;
+ %cond = icmp eq i8 %mask, 0
+ %masked_val = select i1 %cond, i8 %val, i8 -1
+ %res = call i8 @llvm.umin.i8(i8 %acc, i8 %masked_val)
+ ret i8 %res
+}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does one vectorize and not the other?
✅ With the latest revision this PR passed the C/C++ code formatter. |
|
I see. The issue has no relation to the specific form of the select here, the problem is that we are breaking up a reduction pattern. |
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fold you are actually interested in is located at
llvm-project/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
Lines 3870 to 3879 in a19889c
// Try to fold intrinsic into select operands. This is legal if: | |
// * The intrinsic is speculatable. | |
// * The select condition is not a vector, or the intrinsic does not | |
// perform cross-lane operations. | |
if (isSafeToSpeculativelyExecuteWithVariableReplaced(&CI) && | |
isNotCrossLaneOperation(II)) | |
for (Value *Op : II->args()) | |
if (auto *Sel = dyn_cast<SelectInst>(Op)) | |
if (Instruction *R = FoldOpIntoSelect(*II, Sel)) | |
return R; |
Though I'm not entirely sure precisely what we should be restricting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it'd be better to revert #116073?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it'd be better to revert #116073?
It may take more time to evaluate performance regression caused by the revert...
I'd like to disable this fold when the binary intrinsic is a reduction. We have a similar helper matchSimpleRecurrence
in ValueTracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to disable this fold when the binary intrinsic is a reduction. We have a similar helper
matchSimpleRecurrence
in ValueTracking.
I agree that testing specifically for a recurrence would be best.
I think a simple check would look something like this (untested):
if (isa<MinMaxIntrinsic>(&Op))
for (Value *IntrinOp : Op.operands())
if (auto *PN = dyn_cast<PHINode>(IntrinOp))
for (Value *PhiOp = PN->operands())
if (PhiOp == &Op)
return nullptr;
select(umin(X, Y), X)
with non-constant maskselect(umin(X, Y), X)
with min/max values in false arm
I've landed a phase ordering test in cef5a31. Please rebase & regenerate when updating this PR. |
bfeceb2
to
69eaf81
Compare
@nikic The test you added now fails, what's the next step? |
Regenerate it using update_test_checks.py and commit the result. |
Why is this reverted back to the old code again? |
@nikic I wasn't able to make your snippet work, sorry. |
@thevar1able Doesn't exactly the code you had work? I tried it and the test was vectorized: https://gist.github.com/nikic/4cf44aac838daa41be953c8a7d06920c |
@nikic No, it did not. I didn't realize at first that update_test_checks.py was picking up my |
@nikic Sorry, I was wrong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Is it ready for merge? |
Thank you! I see 20.1.7 was released just a couple hours before merge; is 20.1.8 expected, or should I expect this to land in 21? |
…es in false arm (llvm#143020) Fixes llvm#139050. This patch adds a check to avoid folding min/max reduction into select, which may block loop vectorization. The issue is that the following snippet: ``` declare i8 @llvm.umin.i8(i8, i8) define i8 @masked_min_fold_bug(i8 %acc, i8 %val, i8 %mask) { ; CHECK-LABEL: @masked_min_fold_bug( ; CHECK: %cond = icmp eq i8 %mask, 0 ; CHECK: %masked_val = select i1 %cond, i8 %val, i8 255 ; CHECK: call i8 @llvm.umin.i8(i8 %acc, i8 %masked_val) ; %cond = icmp eq i8 %mask, 0 %masked_val = select i1 %cond, i8 %val, i8 255 %res = call i8 @llvm.umin.i8(i8 %acc, i8 %masked_val) ret i8 %res } ``` is being optimized to the following code, which can not be vectorized later. ``` declare i8 @llvm.umin.i8(i8, i8) #0 define i8 @masked_min_fold_bug(i8 %acc, i8 %val, i8 %mask) { %cond = icmp eq i8 %mask, 0 %1 = call i8 @llvm.umin.i8(i8 %acc, i8 %val) %res = select i1 %cond, i8 %1, i8 %acc ret i8 %res } attributes #0 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) } ``` Expected: ``` declare i8 @llvm.umin.i8(i8, i8) #0 define i8 @masked_min_fold_bug(i8 %acc, i8 %val, i8 %mask) { %cond = icmp eq i8 %mask, 0 %masked_val = select i1 %cond, i8 %val, i8 -1 %res = call i8 @llvm.umin.i8(i8 %acc, i8 %masked_val) ret i8 %res } attributes #0 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) } ``` https://godbolt.org/z/cYMheKE5r
Fixes #139050.
Please suggest a better fix if I'm doing something wrong, I have no understanding of InstCombine optimization.
Missing optimization is a blocker for switching ClickHouse build to
clang-20
.The issue is that the following snippet:
is being optimized to the following code, which can not be vectorized later.
Expected:
https://godbolt.org/z/cYMheKE5r