[SPARK-21492][SQL][2.4] Fix memory leak in SortMergeJoin #26210

xuanyuanking · 2019-10-22T12:14:54Z

What changes were proposed in this pull request?

We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below:

Add function named cleanupResources in SparkPlan, which default call children's cleanupResources function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call super.cleanupResources, like SortExec in this PR.
Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the cleanupResources to do the cleanup job for all its upstream(children) operator.

Why are the changes needed?

Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario.

Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test here) will pass with this PR.

from pyspark.sql.functions import rand, col

spark.conf.set("spark.sql.join.preferSortMergeJoin", "true")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

r1 = spark.range(1, 1001).select(col("id").alias("timestamp1"))
r1 = r1.withColumn('value', rand())
r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2"))
r2 = r2.withColumn('value2', rand())
joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner")
joined = joined.coalesce(1)
joined.explain()
joined.show()

/### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. /### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. /### Does this PR introduce any user-facing change? No. /### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](apache#23762 (comment))) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ```

SparkQA · 2019-10-22T12:18:17Z

Test build #112460 has finished for PR 26210 at commit 814e64f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-22T15:37:15Z

Test build #112461 has finished for PR 26210 at commit a4422d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-10-22T19:07:02Z

Please add [2.4] to PR title?

cloud-fan · 2019-10-23T07:53:33Z

thanks, merging to 2.4!

### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. ### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](#23762 (comment))) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ``` Closes #26210 from xuanyuanking/SPARK-21492-backport. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

xuanyuanking · 2019-10-23T08:52:10Z

Thanks!

### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. ### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](apache#23762 (comment))) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ``` Closes apache#26210 from xuanyuanking/SPARK-21492-backport. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

xuanyuanking mentioned this pull request Oct 22, 2019

[SPARK-21492][SQL] Fix memory leak in SortMergeJoin #26164

Closed

fix

a4422d6

dongjoon-hyun added the SQL label Oct 22, 2019

xuanyuanking changed the title ~~[SPARK-21492][SQL] Fix memory leak in SortMergeJoin~~ [SPARK-21492][SQL][2.4] Fix memory leak in SortMergeJoin Oct 23, 2019

cloud-fan closed this Oct 23, 2019

xuanyuanking deleted the SPARK-21492-backport branch October 23, 2019 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-21492][SQL][2.4] Fix memory leak in SortMergeJoin #26210

[SPARK-21492][SQL][2.4] Fix memory leak in SortMergeJoin #26210

Uh oh!

xuanyuanking commented Oct 22, 2019

Uh oh!

SparkQA commented Oct 22, 2019

Uh oh!

SparkQA commented Oct 22, 2019

Uh oh!

viirya commented Oct 22, 2019

Uh oh!

cloud-fan commented Oct 23, 2019

Uh oh!

xuanyuanking commented Oct 23, 2019

Uh oh!

Uh oh!

[SPARK-21492][SQL][2.4] Fix memory leak in SortMergeJoin #26210

[SPARK-21492][SQL][2.4] Fix memory leak in SortMergeJoin #26210

Uh oh!

Conversation

xuanyuanking commented Oct 22, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 22, 2019

Uh oh!

SparkQA commented Oct 22, 2019

Uh oh!

viirya commented Oct 22, 2019

Uh oh!

cloud-fan commented Oct 23, 2019

Uh oh!

xuanyuanking commented Oct 23, 2019

Uh oh!

Uh oh!