Skip to content

[SPARK-33861][SQL] Simplify conditional in predicate #30865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

[SPARK-33861][SQL] Simplify conditional in predicate #30865

wants to merge 8 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Dec 20, 2020

What changes were proposed in this pull request?

This pr simplify conditional in predicate, after this change we can push down the filter to datasource:

Expression After simplify
IF(cond, trueVal, false) AND(cond, trueVal)
IF(cond, trueVal, true) OR(NOT(cond), trueVal)
IF(cond, false, falseVal) AND(NOT(cond), elseVal)
IF(cond, true, falseVal) OR(cond, elseVal)
CASE WHEN cond THEN trueVal ELSE false END AND(cond, trueVal)
CASE WHEN cond THEN trueVal END AND(cond, trueVal)
CASE WHEN cond THEN trueVal ELSE null END AND(cond, trueVal)
CASE WHEN cond THEN trueVal ELSE true END OR(NOT(cond), trueVal)
CASE WHEN cond THEN false ELSE elseVal END AND(NOT(cond), elseVal)
CASE WHEN cond THEN false END false
CASE WHEN cond THEN true ELSE elseVal END OR(cond, elseVal)
CASE WHEN cond THEN true END cond

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

@wangyum wangyum marked this pull request as draft December 20, 2020 15:58
@github-actions github-actions bot added the SQL label Dec 20, 2020
@wangyum
Copy link
Member Author

wangyum commented Dec 21, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37724/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37724/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Test build #133125 has finished for PR 30865 at commit f786f18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum

This comment has been minimized.

@wangyum

This comment has been minimized.

@wangyum

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Test build #133231 has finished for PR 30865 at commit 448faf0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum changed the title [WIP][SPARK-33861][SQL] Simplify conditional in predicate [SPARK-33861][SQL] Simplify conditional in predicate Dec 23, 2020
@wangyum wangyum marked this pull request as ready for review December 23, 2020 06:57
@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37864/

@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37864/

@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Test build #133270 has finished for PR 30865 at commit c90227a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case CaseWhen(Seq((cond, trueValue)), Some(TrueLiteral)) =>
Or(Not(cond), trueValue)
case CaseWhen(Seq((cond, FalseLiteral)), elseValue) =>
And(Not(cond), elseValue.getOrElse(Literal(null, BooleanType)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elseValue.getOrElse(FalseLiteral)

case CaseWhen(Seq((cond, FalseLiteral)), elseValue) =>
And(Not(cond), elseValue.getOrElse(Literal(null, BooleanType)))
case CaseWhen(Seq((cond, TrueLiteral)), elseValue) =>
Or(cond, elseValue.getOrElse(Literal(null, BooleanType)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

And(Not(cond), elseValue.getOrElse(Literal(null, BooleanType)))
case CaseWhen(Seq((cond, TrueLiteral)), elseValue) =>
Or(cond, elseValue.getOrElse(Literal(null, BooleanType)))
case e if e.dataType == BooleanType => e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add an assert. The analyzer should guarantee it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean something like

case e =>
  assert(e.dataType != BooleanType, ...)
  e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the added assert can never be triggered.
I think @cloud-fan meant also to remove the case e if e.dataType == BooleanType => e

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37919/

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Test build #133327 has finished for PR 30865 at commit cccbaf1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37919/

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37927/

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37927/

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Test build #133336 has finished for PR 30865 at commit 8bd9ef9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.sql.types.BooleanType

/**
* A rule that converting conditional expressions to predicate expressions, if possible, in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that converting -> that converts

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37932/

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37932/

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 32d4a2b Dec 24, 2020
@wangyum wangyum deleted the SPARK-33861 branch December 24, 2020 08:12
@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Test build #133341 has finished for PR 30865 at commit 878beb0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you, @wangyum and @cloud-fan .

And(cond, trueValue)
case CaseWhen(Seq((cond, trueValue)), Some(TrueLiteral)) =>
Or(Not(cond), trueValue)
case CaseWhen(Seq((_, FalseLiteral)), Some(FalseLiteral) | None) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skips evaluating the condition, and we should make sure the condition is deterministic. @wangyum can you fix it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will fix it later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also it matters if cond throws an exception. Should probably think about NoThrow too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make these condition deterministic?

case FalseLiteral And _ => FalseLiteral
case _ And FalseLiteral => FalseLiteral
case TrueLiteral Or _ => TrueLiteral
case _ Or TrueLiteral => TrueLiteral

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bart-samwel
Copy link

I'm worried about this change. In ANSI dialect, conditions can easily throw exceptions, and this change could cause more conditions to be evaluated. E.g. IF(a, b, FALSE) turns into a AND b. But what if b can throw an error? Multiple things can happen:

If b can be pushed down but a cannot, then the shortcut evaluation behavior of AND is lost, and b is evaluated for values where a is not true. That is a semantic regression, because IF can be used to guard against such errors. An example case would be:

SELECT ...
FROM stores s JOIN store_revenues r ON s.id = r.store_id
WHERE IF(s.has_customers, r.revenue / r.num_customers > 0, FALSE)

In this case, the division filter can be pushed down into the scan of store_revenues, but it will cause division by zero.

Another case is where we are able to push down both a and b, and somehow the conjunctions get reordered in the process. E.g. IF(r.num_customers > 0, r.revenue / r.num_customers > 0, FALSE) could turn into r.revenue / r.num_customers > 0 AND r.num_customers > 0, doing the division by zero first. This reordering is not possible with IF, because that expression semantically enforces the ordering.

@cloud-fan
Copy link
Contributor

Yea, I think we need to take care of the "may fail" property when optimizing expressions, as well as the "deterministic" property. Many optimizer rules need to be revisited.

HyukjinKwon pushed a commit that referenced this pull request Jan 7, 2021
… consider deterministic

### What changes were proposed in this pull request?

This pr address #30865 (review) to fix simplify conditional in predicate should consider deterministic.

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31067 from wangyum/SPARK-33861-2.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@shardulm94
Copy link
Contributor

@cloud-fan Was there any followup work done to address the issue pointed out by @bart-samwel where some expressions may fail is evaluated without the condition being evaluated first? Seems like converting IF to AND is not safe.

@cloud-fan
Copy link
Contributor

conditional expressions is the only way people can control side effects. e.g. IF(cond, trueExpr, falseExpr) guarantees that trueExpr will only be evaluated if condition is true. This optimization breaks it and the benefits do not seem significant.

@wangyum shall we revert this optimizer rule? cc @viirya @rednaxelafx @sigmod

@wangyum
Copy link
Member Author

wangyum commented Aug 30, 2022

OK. revert pr: #37729

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants