Skip to content

[SPARK-27279][SQL] Reuse subquery should compare child plan of SubqueryExec #24214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

adrian-wang
Copy link
Contributor

What changes were proposed in this pull request?

For now, ReuseSubquery in Spark compares two subqueries at SubqueryExec level, which invalidates the ReuseSubquery rule. This pull request fixes this, and add a configuration key for subquery reuse exclusively.

How was this patch tested?

add a unit test.

return plan
}
// Build a hash map using schema of subqueries to avoid O(N*N) sameResult calls.
val subqueries = mutable.HashMap[StructType, ArrayBuffer[SubqueryExec]]()
plan transformAllExpressions {
case sub: ExecSubqueryExpression =>
val sameSchema = subqueries.getOrElseUpdate(sub.plan.schema, ArrayBuffer[SubqueryExec]())
val sameResult = sameSchema.find(_.sameResult(sub.plan))
val sameResult = sameSchema.find(_.child.sameResult(sub.plan.child))
Copy link
Member

@maropu maropu Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes the bug you described in the PR description? Sorry, but I'm not clearly sure about what's the problem here, so could you show us an example query to reproduce the bug? Also, could you make the title more precise...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, you can see the example in the added test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure either about this fix. I remember this problem which I tried to fix in 36fa664. As you can see, the PR didn't actually do this fix as per @cloud-fan 's comment. So I am pinging him here for reviewing this too. Anyway, when I debugged it, the problem was that being the same object according to equals, the tree was not updated, so I see no reason for the current fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just overwrite SubqueryExec.doCanonicalize to strip it? It seems the SubqueryExec.name is stopping us from match semantically same subqueries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated code as @cloud-fan suggests, thanks for all your reviews!

@maropu maropu changed the title [SPARK-27279] [SQL] reuse subquery correctly [SPARK-27279][SQL] Reuse subquery correctly Mar 26, 2019
@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #103941 has finished for PR 24214 at commit 5fa78c8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@adrian-wang
Copy link
Contributor Author

retest this please.

@adrian-wang adrian-wang changed the title [SPARK-27279][SQL] Reuse subquery correctly [SPARK-27279][SQL] Reuse subquery should compare child plan of SubqueryExec Mar 26, 2019
@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #103952 has finished for PR 24214 at commit 5fa78c8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@adrian-wang
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #103961 has finished for PR 24214 at commit 5fa78c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -889,6 +889,12 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val SUBQUERY_REUSE_ENABLED = buildConf("spark.sql.subquery.reuse")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this config in this pr?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed a separated thing, but I'm ok to include it in this PR, as it's obviously wrong to use exchangeReuseEnabled in ReuseSubquery

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a PR separately to fix this problem before, #23998

@@ -674,6 +674,8 @@ case class SubqueryExec(name: String, child: SparkPlan) extends UnaryExecNode {

override def outputOrdering: Seq[SortOrder] = child.outputOrdering

override def doCanonicalize(): SparkPlan = child.canonicalized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#18169 fixed this problem like this before, but I don't know why was it reverted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile may you please share if you remember why it was reverted? Thanks.

@@ -889,6 +889,12 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val SUBQUERY_REUSE_ENABLED = buildConf("spark.sql.subquery.reuse")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a PR separately to fix this problem before, #23998

@SparkQA
Copy link

SparkQA commented Mar 27, 2019

Test build #104002 has finished for PR 24214 at commit 40dc3f9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2019

Test build #104001 has finished for PR 24214 at commit 7177ed7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Hi @adrian-wang can you do a rebase? I just merged #23998

@SparkQA
Copy link

SparkQA commented Mar 27, 2019

Test build #104006 has finished for PR 24214 at commit 6a69e05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@adrian-wang
Copy link
Contributor Author

@cloud-fan , rebase done and test passed.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks! Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants