[SPARK-21492][SQL] Fix memory leak in SortMergeJoin #26164

xuanyuanking · 2019-10-18T13:07:45Z

What changes were proposed in this pull request?

We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below:

Add function named cleanupResources in SparkPlan, which default call children's cleanupResources function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call super.cleanupResources, like SortExec in this PR.
Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the cleanupResources to do the cleanup job for all its upstream(children) operator.

Why are the changes needed?

Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario.

Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks @taosaildrone for providing this test here) will pass with this PR.

from pyspark.sql.functions import rand, col

spark.conf.set("spark.sql.join.preferSortMergeJoin", "true")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
# spark.conf.set("spark.sql.sortMergeJoinExec.eagerCleanupResources", "true")

r1 = spark.range(1, 1001).select(col("id").alias("timestamp1"))
r1 = r1.withColumn('value', rand())
r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2"))
r2 = r2.withColumn('value2', rand())
joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner")
joined = joined.coalesce(1)
joined.explain()
joined.show()

xuanyuanking · 2019-10-18T13:14:30Z

cc @cloud-fan @gatorsmile

SparkQA · 2019-10-18T13:25:45Z

Test build #112276 has finished for PR 26164 at commit f9567d5.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

viirya · 2019-10-18T20:05:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

@@ -161,6 +162,10 @@ case class SortMergeJoinExec(
    sqlContext.conf.sortMergeJoinExecBufferInMemoryThreshold
  }

+  private def needEagerCleanup: Boolean = {
+    sqlContext.conf.getConf(SORT_MERGE_JOIN_EXEC_EAGER_CLEANUP_RESOURCES)


Seems this only controls cleanup behavior in codege path?

Ah yes, thanks for reminding, done in 631f3cb

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala

SparkQA · 2019-10-19T03:48:02Z

Test build #112308 has finished for PR 26164 at commit 631f3cb.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-20T04:19:50Z

Test build #112328 has finished for PR 26164 at commit ec0f160.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-20T07:05:02Z

Test build #112336 has finished for PR 26164 at commit 93815f8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-10-21T00:02:50Z

retest this please.

SparkQA · 2019-10-21T03:41:59Z

Test build #112353 has finished for PR 26164 at commit 93815f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

SparkQA · 2019-10-21T07:05:02Z

Test build #112362 has finished for PR 26164 at commit defaaf2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

cloud-fan · 2019-10-21T08:08:57Z

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

+  private def attachCleanupResourceChecker(plan: SparkPlan): Unit = {
+    // SPARK-21492: Check cleanupResources are finally triggered in SortExec node for every
+    // test case
+    val sorts = new ArrayBuffer[SortExec]()


super nit: now we don't need this array

plan.foreachUp { case s: SortExec => // add spy case _ => }

Yep, just do the simplify at the same time :) 6d6dd5a

SparkQA · 2019-10-21T11:12:29Z

Test build #112370 has finished for PR 26164 at commit 7787d45.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-21T11:44:15Z

Test build #112372 has finished for PR 26164 at commit 6d6dd5a.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-21T12:06:49Z

retest this please

SparkQA · 2019-10-21T15:51:53Z

Test build #112389 has finished for PR 26164 at commit 6d6dd5a.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2019-10-22T03:05:12Z

retest this please

SparkQA · 2019-10-22T06:51:26Z

Test build #112431 has finished for PR 26164 at commit 6d6dd5a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-22T11:08:24Z

thanks, merging to master!

cloud-fan · 2019-10-22T11:09:24Z

@xuanyuanking can you send a PR for 2.4 backport? thanks!

xuanyuanking · 2019-10-22T12:15:24Z

Sure, backport to 2.4 in #26210.

…database style iterator ### What changes were proposed in this pull request? Reimplement the iterator in UnsafeExternalRowSorter in database style. This can be done by reusing the `RowIterator` in our code base. ### Why are the changes needed? During the job in #26164, after involving a var `isReleased` in `hasNext`, there's possible that `isReleased` is false when calling `hasNext`, but it becomes true before calling `next`. A safer way is using database-style iterator: `advanceNext` and `getRow`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #26229 from xuanyuanking/SPARK-21492-follow-up. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…database style iterator ### What changes were proposed in this pull request? Reimplement the iterator in UnsafeExternalRowSorter in database style. This can be done by reusing the `RowIterator` in our code base. ### Why are the changes needed? During the job in #26164, after involving a var `isReleased` in `hasNext`, there's possible that `isReleased` is false when calling `hasNext`, but it becomes true before calling `next`. A safer way is using database-style iterator: `advanceNext` and `getRow`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #26229 from xuanyuanking/SPARK-21492-follow-up. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9e77d48) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…database style iterator ### What changes were proposed in this pull request? Reimplement the iterator in UnsafeExternalRowSorter in database style. This can be done by reusing the `RowIterator` in our code base. ### Why are the changes needed? During the job in apache#26164, after involving a var `isReleased` in `hasNext`, there's possible that `isReleased` is false when calling `hasNext`, but it becomes true before calling `next`. A safer way is using database-style iterator: `advanceNext` and `getRow`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes apache#26229 from xuanyuanking/SPARK-21492-follow-up. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9e77d48) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan reviewed Oct 18, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 18, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 18, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala Outdated Show resolved Hide resolved

xuanyuanking mentioned this pull request Oct 18, 2019

[SPARK-21492] [SQL] Fix memory leak issue in SMJ #25888

Closed

dongjoon-hyun added the SQL label Oct 18, 2019

viirya reviewed Oct 18, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

viirya reviewed Oct 18, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala Show resolved Hide resolved

xuanyuanking added 2 commits October 19, 2019 11:34

[SPARK-21492][SQL] Fix memory leak in SortMergeJoin

4dc6ef4

Address comments

631f3cb

xuanyuanking force-pushed the SPARK-21492 branch from f9567d5 to 631f3cb Compare October 19, 2019 03:34

fix java doc

ec0f160

ut fix

93815f8

delete the config and add a new test case

defaaf2

cloud-fan reviewed Oct 21, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala Outdated Show resolved Hide resolved

simplify test case

7787d45

cloud-fan reviewed Oct 21, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 21, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala Outdated Show resolved Hide resolved

comment address

b41f33a

cloud-fan reviewed Oct 21, 2019

View reviewed changes

simplify

6d6dd5a

cloud-fan closed this in bb49c80 Oct 22, 2019

xuanyuanking deleted the SPARK-21492 branch October 22, 2019 11:53

xuanyuanking mentioned this pull request Oct 23, 2019

[SPARK-21492][SQL][Follow Up] Reimplement UnsafeExternalRowSorter in database style iterator #26229

Closed

[SPARK-21492][SQL] Fix memory leak in SortMergeJoin #26164

[SPARK-21492][SQL] Fix memory leak in SortMergeJoin #26164

Uh oh!

Conversation

xuanyuanking commented Oct 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

xuanyuanking commented Oct 18, 2019

Uh oh!

SparkQA commented Oct 18, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viirya Oct 18, 2019

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Oct 19, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Oct 19, 2019

Uh oh!

SparkQA commented Oct 20, 2019

Uh oh!

SparkQA commented Oct 20, 2019

Uh oh!

viirya commented Oct 21, 2019

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

Uh oh!

Uh oh!

cloud-fan Oct 21, 2019

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Oct 21, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

cloud-fan commented Oct 21, 2019

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

xuanyuanking commented Oct 22, 2019

Uh oh!

SparkQA commented Oct 22, 2019

Uh oh!

cloud-fan commented Oct 22, 2019

Uh oh!

cloud-fan commented Oct 22, 2019

Uh oh!

xuanyuanking commented Oct 22, 2019

Uh oh!

Uh oh!

xuanyuanking commented Oct 18, 2019 •

edited

Loading