-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-33861][SQL] Simplify conditional in predicate #30865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
Outdated
Show resolved
Hide resolved
retest this please. |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133125 has finished for PR 30865 at commit
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Test build #133231 has finished for PR 30865 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133270 has finished for PR 30865 at commit
|
...src/main/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalsInPredicate.scala
Show resolved
Hide resolved
case CaseWhen(Seq((cond, trueValue)), Some(TrueLiteral)) => | ||
Or(Not(cond), trueValue) | ||
case CaseWhen(Seq((cond, FalseLiteral)), elseValue) => | ||
And(Not(cond), elseValue.getOrElse(Literal(null, BooleanType))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elseValue.getOrElse(FalseLiteral)
case CaseWhen(Seq((cond, FalseLiteral)), elseValue) => | ||
And(Not(cond), elseValue.getOrElse(Literal(null, BooleanType))) | ||
case CaseWhen(Seq((cond, TrueLiteral)), elseValue) => | ||
Or(cond, elseValue.getOrElse(Literal(null, BooleanType))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
And(Not(cond), elseValue.getOrElse(Literal(null, BooleanType))) | ||
case CaseWhen(Seq((cond, TrueLiteral)), elseValue) => | ||
Or(cond, elseValue.getOrElse(Literal(null, BooleanType))) | ||
case e if e.dataType == BooleanType => e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add an assert. The analyzer should guarantee it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean something like
case e =>
assert(e.dataType != BooleanType, ...)
e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the added assert can never be triggered.
I think @cloud-fan meant also to remove the case e if e.dataType == BooleanType => e
...src/main/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalsInPredicate.scala
Outdated
Show resolved
Hide resolved
...src/main/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalsInPredicate.scala
Outdated
Show resolved
Hide resolved
Kubernetes integration test starting |
Test build #133327 has finished for PR 30865 at commit
|
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133336 has finished for PR 30865 at commit
|
import org.apache.spark.sql.types.BooleanType | ||
|
||
/** | ||
* A rule that converting conditional expressions to predicate expressions, if possible, in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that converting
-> that converts
Kubernetes integration test starting |
Kubernetes integration test status success |
thanks, merging to master! |
Test build #133341 has finished for PR 30865 at commit
|
Thank you, @wangyum and @cloud-fan . |
And(cond, trueValue) | ||
case CaseWhen(Seq((cond, trueValue)), Some(TrueLiteral)) => | ||
Or(Not(cond), trueValue) | ||
case CaseWhen(Seq((_, FalseLiteral)), Some(FalseLiteral) | None) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This skips evaluating the condition, and we should make sure the condition is deterministic. @wangyum can you fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will fix it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and also it matters if cond
throws an exception. Should probably think about NoThrow
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make these condition deterministic?
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
Lines 295 to 298 in 26d8df3
case FalseLiteral And _ => FalseLiteral | |
case _ And FalseLiteral => FalseLiteral | |
case TrueLiteral Or _ => TrueLiteral | |
case _ Or TrueLiteral => TrueLiteral |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about this change. In ANSI dialect, conditions can easily throw exceptions, and this change could cause more conditions to be evaluated. E.g. If
In this case, the division filter can be pushed down into the scan of Another case is where we are able to push down both |
Yea, I think we need to take care of the "may fail" property when optimizing expressions, as well as the "deterministic" property. Many optimizer rules need to be revisited. |
… consider deterministic ### What changes were proposed in this pull request? This pr address #30865 (review) to fix simplify conditional in predicate should consider deterministic. ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31067 from wangyum/SPARK-33861-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@cloud-fan Was there any followup work done to address the issue pointed out by @bart-samwel where some expressions may fail is evaluated without the condition being evaluated first? Seems like converting |
conditional expressions is the only way people can control side effects. e.g. @wangyum shall we revert this optimizer rule? cc @viirya @rednaxelafx @sigmod |
OK. revert pr: #37729 |
What changes were proposed in this pull request?
This pr simplify conditional in predicate, after this change we can push down the filter to datasource:
Why are the changes needed?
Improve query performance.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test.