Skip to content

[SPARK-33938][SQL] Optimize Like Any/All by LikeSimplification #30975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 55 commits into from

Conversation

beliefer
Copy link
Contributor

What changes were proposed in this pull request?

We should optimize Like Any/All by LikeSimplification to improve performance.

Why are the changes needed?

Optimize Like Any/All

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

Jenkins test.

beliefer and others added 30 commits June 19, 2020 10:36
@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133635 has finished for PR 30975 at commit fec4fe5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SequenceFileRDDFunctions[K: IsWritable: ClassTag, V: IsWritable: ClassTag](
  • case class ResolvedTable(
  • case class StringTrim(srcStr: Expression, trimStr: Option[Expression] = None)
  • case class StringTrimLeft(srcStr: Expression, trimStr: Option[Expression] = None)
  • case class StringTrimRight(srcStr: Expression, trimStr: Option[Expression] = None)
  • case class DescribeColumnExec(

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38208/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38208/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133619 has finished for PR 30975 at commit b6ba902.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38226/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38226/

@@ -504,4 +504,8 @@ object QueryCompilationErrors {
def columnDoesNotExistError(colName: String): Throwable = {
new AnalysisException(s"Column $colName does not exist")
}

def cannotSimplifyMultiLikeError(multi: MultiLikeBase): Throwable = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove it now.

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38243/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38243/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133641 has finished for PR 30975 at commit 52ecedd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133669 has finished for PR 30975 at commit e2c7dfe.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133654 has finished for PR 30975 at commit 454a68b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • sealed trait LikeAllBase extends MultiLikeBase
  • sealed trait LikeAnyBase extends MultiLikeBase

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133665 has finished for PR 30975 at commit 8f26d1b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • sealed abstract class LikeAllBase extends MultiLikeBase
  • sealed abstract class LikeAnyBase extends MultiLikeBase

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38270/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38270/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133681 has finished for PR 30975 at commit 6acf1c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The optimization was already there before we add the LikeAll expression in #29999 , so this PR fixes a perf regression in 3.1. Thanks, merging to master/3.1!

@cloud-fan cloud-fan closed this in 26d8df3 Jan 6, 2021
@cloud-fan
Copy link
Contributor

It has conflicts with 3.1, @beliefer can you create a backport PR?

@beliefer
Copy link
Contributor Author

beliefer commented Jan 6, 2021

It has conflicts with 3.1, @beliefer can you create a backport PR?

OK.

@beliefer
Copy link
Contributor Author

beliefer commented Jan 6, 2021

@cloud-fan Thanks for your work!

multi
} else {
multi match {
case l: LikeAll => And(replacements.reduceLeft(And), l.copy(patterns = remainPatterns))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may cause StackOverflowError.

scala> spark.sql("drop table SPARK_33938")
res6: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create table SPARK_33938(id string) using parquet")
res7: org.apache.spark.sql.DataFrame = []

scala> val values = Range(1, 10000)
values: scala.collection.immutable.Range = Range 1 until 10000

scala> spark.sql(s"select * from SPARK_33938 where id like all (${values.map(s => s"'$s'").mkString(", ")})").show
java.lang.StackOverflowError
  at java.lang.ThreadLocal.set(ThreadLocal.java:201)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.set(TreeNode.scala:62)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum I will fix this issue.

Copy link
Contributor Author

@beliefer beliefer Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, patterns a, b, c, d, e, and f. Suppose a, b, c, and d are patterns that can be optimized with startsWith. According to the current logic, it is startsWith(a)&startsWith(b)&startsWith(c)&startsWith(d)&LikeAll(e,f). Their hierarchy is not shown here.
We can use the threshold to determine the number of patterns that can be optimized, for example, only two patterns can be optimized. Then it is startsWith(a)&startsWith(b)&LikeAll(c,d,e,f)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants