-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-33938][SQL] Optimize Like Any/All by LikeSimplification #30975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #133635 has finished for PR 30975 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133619 has finished for PR 30975 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
@@ -504,4 +504,8 @@ object QueryCompilationErrors { | |||
def columnDoesNotExistError(colName: String): Throwable = { | |||
new AnalysisException(s"Column $colName does not exist") | |||
} | |||
|
|||
def cannotSimplifyMultiLikeError(multi: MultiLikeBase): Throwable = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove it now.
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133641 has finished for PR 30975 at commit
|
Test build #133669 has finished for PR 30975 at commit
|
Test build #133654 has finished for PR 30975 at commit
|
Test build #133665 has finished for PR 30975 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133681 has finished for PR 30975 at commit
|
The optimization was already there before we add the |
It has conflicts with 3.1, @beliefer can you create a backport PR? |
OK. |
@cloud-fan Thanks for your work! |
multi | ||
} else { | ||
multi match { | ||
case l: LikeAll => And(replacements.reduceLeft(And), l.copy(patterns = remainPatterns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may cause StackOverflowError
.
scala> spark.sql("drop table SPARK_33938")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.sql("create table SPARK_33938(id string) using parquet")
res7: org.apache.spark.sql.DataFrame = []
scala> val values = Range(1, 10000)
values: scala.collection.immutable.Range = Range 1 until 10000
scala> spark.sql(s"select * from SPARK_33938 where id like all (${values.map(s => s"'$s'").mkString(", ")})").show
java.lang.StackOverflowError
at java.lang.ThreadLocal.set(ThreadLocal.java:201)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.set(TreeNode.scala:62)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangyum I will fix this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, patterns a, b, c, d, e, and f. Suppose a, b, c, and d are patterns that can be optimized with startsWith
. According to the current logic, it is startsWith(a)&startsWith(b)&startsWith(c)&startsWith(d)&LikeAll(e,f).
Their hierarchy is not shown here.
We can use the threshold to determine the number of patterns that can be optimized, for example, only two patterns can be optimized. Then it is startsWith(a)&startsWith(b)&LikeAll(c,d,e,f)
What changes were proposed in this pull request?
We should optimize Like Any/All by LikeSimplification to improve performance.
Why are the changes needed?
Optimize Like Any/All
Does this PR introduce any user-facing change?
'No'.
How was this patch tested?
Jenkins test.