-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-32721][SQL] Simplify if clauses with null and boolean #29567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #127976 has finished for PR 29567 at commit
|
Are you sure these rules are correct? I tried a quick test and 4 out of 6 of these failed.
It seem that there might be some sort of equivalence here, but I'm not 100% sure how it should look like. |
I don't look into it yet, but the motivation is the same with #27231 ? |
Thanks for checking. I made a few mistakes in the draft PR. Firstly, the second rule should be:
Secondly, the predicate Since three-valued logic has the following properties:
I think the above should be semantically equivalent. |
I think it should be correct now, but it would be nice to have some checks in the tests. Are there equivalent cases for Perhaps instead of a new |
val nullLiteral = Literal(null, BooleanType) | ||
def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { | ||
case If(p, Literal(null, _), FalseLiteral) => And(p, nullLiteral) | ||
case If(p, Literal(null, _), TrueLiteral) => Or(p, nullLiteral) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it common enough to write an if with a null
with returning a bool? Adding a rule isn't free and it gives some overhead to all the queries basically during the planning. I would avoid doing it if it's pretty unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about merging these to the existing SimplifyConditionals
as mentioned by @tanelk above? it will then just be a few more pattern matchings, right?
Test build #127998 has finished for PR 29567 at commit
|
Test build #127999 has finished for PR 29567 at commit
|
Test build #128002 has finished for PR 29567 at commit
|
This will be a very useful optimization to turn |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
Outdated
Show resolved
Hide resolved
testRelationWithData.where(If('e.isNull, Literal.create(null, BooleanType), false)).analyze) | ||
testRelationWithData.where(And(Literal(null, BooleanType), 'e.isNull)).analyze) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I think it is better to test BooleanSimplification
like original, but BooleanSimplificationSuite
's optimizer mixes SimplifyConditionals
and BooleanSimplification
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. What is the recommended way for adding test suite for optimizers? should each test suite only use the corresponding optimizer rule, or it's OK to also add those relevant ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we only test corresponding rule in each test suite. But some suites use multiple rules like this one. For this one, although we can still only test BooleanSimplification
if we add another optimizer, but seems too much. Maybe currently it's good enough. Maybe add a comment.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
Outdated
Show resolved
Hide resolved
...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala
Outdated
Show resolved
Hide resolved
} | ||
|
||
// check evaluation also | ||
Seq(TrueLiteral, FalseLiteral).foreach { p => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use the same name in the foreach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
LGTM except couple minor comments. +@dongjoon-hyun for another round of review before we merge it. Thanks. |
Test build #128040 has finished for PR 29567 at commit
|
Looks okay. Leave to @dongjoon-hyun for another check. |
Thank you, @sunchao , @dbtsai , @viirya . cc @gatorsmile since he expressed his concerns at that time. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
Show resolved
Hide resolved
I think the concern @gatorsmile had was in #21850, converting |
...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala
Show resolved
Hide resolved
Test build #128086 has finished for PR 29567 at commit
|
...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala
Outdated
Show resolved
Hide resolved
Merged into master. Thanks! |
Test build #128118 has finished for PR 29567 at commit
|
@@ -463,6 +463,8 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { | |||
case If(Literal(null, _), _, falseValue) => falseValue | |||
case If(cond, trueValue, falseValue) | |||
if cond.deterministic && trueValue.semanticEquals(falseValue) => trueValue | |||
case If(cond, l @ Literal(null, _), FalseLiteral) if !cond.nullable => And(cond, l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the other way around? e.g. case If(cond, FalseLiteral, l @ Literal(null, _))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah it's covered in the followup #29603
### What changes were proposed in this pull request? This is a follow-up on SPARK-32721 and PR #29567. In the previous PR we missed two more cases that can be optimized: ``` if(p, false, null) ==> and(not(p), null) if(p, true, null) ==> or(p, null) ``` ### Why are the changes needed? By transforming if to boolean conjunctions or disjunctions, we can enable more filter pushdown to datasources. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. Closes #29603 from sunchao/SPARK-32721-2. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request? The following if clause: ```sql if(p, null, false) ``` can be simplified to: ```sql and(p, null) ``` Similarly, the clause: ```sql if(p, null, true) ``` can be simplified to ```sql or(not(p), null) ``` iff the predicate `p` is non-nullable, i.e., can be evaluated to either true or false, but not null. ### Why are the changes needed? Converting if to or/and clauses can better push filters down. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes apache#29567 from sunchao/SPARK-32721. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request? This is a follow-up on SPARK-32721 and PR apache#29567. In the previous PR we missed two more cases that can be optimized: ``` if(p, false, null) ==> and(not(p), null) if(p, true, null) ==> or(p, null) ``` ### Why are the changes needed? By transforming if to boolean conjunctions or disjunctions, we can enable more filter pushdown to datasources. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. Closes apache#29603 from sunchao/SPARK-32721-2. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: DB Tsai <d_tsai@apple.com>
What changes were proposed in this pull request?
The following if clause:
if(p, null, false)
can be simplified to:
Similarly, the clause:
if(p, null, true)
can be simplified to
iff the predicate
p
is non-nullable, i.e., can be evaluated to either true or false, but not null.Why are the changes needed?
Converting if to or/and clauses can better push filters down.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests.