Skip to content

[SPARK-32721][SQL] Simplify if clauses with null and boolean #29567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Aug 28, 2020

What changes were proposed in this pull request?

The following if clause:

if(p, null, false)

can be simplified to:

and(p, null)

Similarly, the clause:

if(p, null, true)

can be simplified to

or(not(p), null)

iff the predicate p is non-nullable, i.e., can be evaluated to either true or false, but not null.

Why are the changes needed?

Converting if to or/and clauses can better push filters down.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Aug 28, 2020

Test build #127976 has finished for PR 29567 at commit f9e87f0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tanelk
Copy link
Contributor

tanelk commented Aug 28, 2020

Are you sure these rules are correct?
The first one seems to fail if p is null
And the seconds one seems to fail on all 3 values

I tried a quick test and 4 out of 6 of these failed.

  val nullLiteral = Literal(null, BooleanType)

  Seq(Literal.TrueLiteral, Literal.FalseLiteral, nullLiteral).foreach(p => {
    assertEquivalent(If(p, nullLiteral, Literal.FalseLiteral), And(p, nullLiteral))
    assertEquivalent(If(p, nullLiteral, Literal.TrueLiteral), Or(p, nullLiteral))
  })

  private def assertEquivalent(e1: Expression, e2: Expression): Unit = {
    test("SPARK-32721: " + e1.toString) {
      checkEvaluation (e1, e2.eval (EmptyRow) )
    }
  }

It seem that there might be some sort of equivalence here, but I'm not 100% sure how it should look like.

@maropu
Copy link
Member

maropu commented Aug 28, 2020

I don't look into it yet, but the motivation is the same with #27231 ?

@sunchao
Copy link
Member Author

sunchao commented Aug 28, 2020

Thanks for checking. I made a few mistakes in the draft PR. Firstly, the second rule should be:

if(p, null, true) => or(not(p), null)

Secondly, the predicate p should be deterministic and evaluate to either true or false, but not null. I'll see how we can put restriction that, and add more tests.

Since three-valued logic has the following properties:

a b a and b a or b
true null null true
false null false null

I think the above should be semantically equivalent.

@tanelk
Copy link
Contributor

tanelk commented Aug 28, 2020

I think it should be correct now, but it would be nice to have some checks in the tests.

Are there equivalent cases for if(p, false, null) and if(p, true, null)?

Perhaps instead of a new Rule the cases could be added to the existing SimplifyConditionals?

val nullLiteral = Literal(null, BooleanType)
def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
case If(p, Literal(null, _), FalseLiteral) => And(p, nullLiteral)
case If(p, Literal(null, _), TrueLiteral) => Or(p, nullLiteral)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common enough to write an if with a null with returning a bool? Adding a rule isn't free and it gives some overhead to all the queries basically during the planning. I would avoid doing it if it's pretty unlikely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about merging these to the existing SimplifyConditionals as mentioned by @tanelk above? it will then just be a few more pattern matchings, right?

@SparkQA
Copy link

SparkQA commented Aug 28, 2020

Test build #127998 has finished for PR 29567 at commit e4d929c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2020

Test build #127999 has finished for PR 29567 at commit bb7c38c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 29, 2020

Test build #128002 has finished for PR 29567 at commit cc66198.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao sunchao changed the title [WIP][SPARK-32721][SQL] Simplify if clauses with null and boolean [SPARK-32721][SQL] Simplify if clauses with null and boolean Aug 29, 2020
@sunchao
Copy link
Member Author

sunchao commented Aug 29, 2020

cc @dbtsai @viirya

@dbtsai
Copy link
Member

dbtsai commented Aug 29, 2020

This will be a very useful optimization to turn if causes into expressions that can be pushdown. For CaseWhen if there is only one branch, we can convert it into If. A PR awhile ago #21850 implemented this optimization but didn't end up merging into master because at that time, CaseWhen and If have not much performance difference. Maybe we can revisit that again.

testRelationWithData.where(If('e.isNull, Literal.create(null, BooleanType), false)).analyze)
testRelationWithData.where(And(Literal(null, BooleanType), 'e.isNull)).analyze)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I think it is better to test BooleanSimplification like original, but BooleanSimplificationSuite's optimizer mixes SimplifyConditionals and BooleanSimplification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. What is the recommended way for adding test suite for optimizers? should each test suite only use the corresponding optimizer rule, or it's OK to also add those relevant ones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we only test corresponding rule in each test suite. But some suites use multiple rules like this one. For this one, although we can still only test BooleanSimplification if we add another optimizer, but seems too much. Maybe currently it's good enough. Maybe add a comment.

}

// check evaluation also
Seq(TrueLiteral, FalseLiteral).foreach { p =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use the same name in the foreach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@dbtsai
Copy link
Member

dbtsai commented Aug 30, 2020

LGTM except couple minor comments. +@dongjoon-hyun for another round of review before we merge it. Thanks.

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Test build #128040 has finished for PR 29567 at commit 60e0e92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Aug 30, 2020

Looks okay. Leave to @dongjoon-hyun for another check.

@dongjoon-hyun
Copy link
Member

Thank you, @sunchao , @dbtsai , @viirya .

cc @gatorsmile since he expressed his concerns at that time.

@dbtsai
Copy link
Member

dbtsai commented Aug 31, 2020

cc @gatorsmile since he expressed his concerns at that time.

I think the concern @gatorsmile had was in #21850, converting CaseWhen to If, we will not get too much performance gain. However, this PR is trying to convert If to And or Or when possible, so we are able to pushdown more into data source. With this merged, we could revisit #21850 since it opens up the possibility that when we have one branch in CaseWhen, we can potentially optimize it to And or Or.

@SparkQA
Copy link

SparkQA commented Aug 31, 2020

Test build #128086 has finished for PR 29567 at commit 2b90c19.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Aug 31, 2020

Merged into master. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 1, 2020

Test build #128118 has finished for PR 29567 at commit 71bd033.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -463,6 +463,8 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper {
case If(Literal(null, _), _, falseValue) => falseValue
case If(cond, trueValue, falseValue)
if cond.deterministic && trueValue.semanticEquals(falseValue) => trueValue
case If(cond, l @ Literal(null, _), FalseLiteral) if !cond.nullable => And(cond, l)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the other way around? e.g. case If(cond, FalseLiteral, l @ Literal(null, _))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah it's covered in the followup #29603

dbtsai pushed a commit that referenced this pull request Sep 1, 2020
### What changes were proposed in this pull request?

This is a follow-up on SPARK-32721 and PR #29567. In the previous PR we missed two more cases that can be optimized:
```
if(p, false, null) ==> and(not(p), null)
if(p, true, null) ==> or(p, null)
```

### Why are the changes needed?

By transforming if to boolean conjunctions or disjunctions, we can enable more filter pushdown to datasources.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit tests.

Closes #29603 from sunchao/SPARK-32721-2.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: DB Tsai <d_tsai@apple.com>
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
### What changes were proposed in this pull request?

The following if clause:
```sql
if(p, null, false)
```
can be simplified to:
```sql
and(p, null)
```
Similarly, the clause:
```sql
if(p, null, true)
```
can be simplified to
```sql
or(not(p), null)
```
iff the predicate `p` is non-nullable, i.e., can be evaluated to either true or false, but not null.

### Why are the changes needed?

Converting if to or/and clauses can better push filters down.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes apache#29567 from sunchao/SPARK-32721.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: DB Tsai <d_tsai@apple.com>
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
### What changes were proposed in this pull request?

This is a follow-up on SPARK-32721 and PR apache#29567. In the previous PR we missed two more cases that can be optimized:
```
if(p, false, null) ==> and(not(p), null)
if(p, true, null) ==> or(p, null)
```

### Why are the changes needed?

By transforming if to boolean conjunctions or disjunctions, we can enable more filter pushdown to datasources.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit tests.

Closes apache#29603 from sunchao/SPARK-32721-2.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: DB Tsai <d_tsai@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants