[SPARK-49386][CORE][SQL] Add memory based thresholds for shuffle spill #47856

cxzl25 · 2024-08-23T09:25:34Z

Original author: @amuraru

What changes were proposed in this pull request?

This PR aims to support add memory based thresholds for shuffle spill.

Introduce configuration

spark.shuffle.spill.maxRecordsSizeForSpillThreshold
spark.sql.windowExec.buffer.spill.size.threshold
spark.sql.sessionWindow.buffer.spill.size.threshold
spark.sql.sortMergeJoinExec.buffer.spill.size.threshold
spark.sql.cartesianProductExec.buffer.spill.size.threshold

Why are the changes needed?

We can only determine the number of spills by configuring spark.shuffle.spill.numElementsForceSpillThreshold. In some scenarios, the size of a row may be very large in the memory.

Does this PR introduce any user-facing change?

No

How was this patch tested?

GA

Verified in the production environment, the task time is shortened, the number of spill disks is reduced, there is a better chance to compress the shuffle data, and the size of the spill to disk is also significantly reduced.

Current

24/08/19 07:02:54,947 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO ShuffleExternalSorter: Thread 126 spilling sort data of 62.0 MiB to disk (11490  times so far)
24/08/19 07:02:55,029 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO ShuffleExternalSorter: Thread 126 spilling sort data of 62.0 MiB to disk (11491  times so far)
24/08/19 07:02:55,093 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO ShuffleExternalSorter: Thread 126 spilling sort data of 62.0 MiB to disk (11492  times so far)
24/08/19 07:08:59,894 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO Executor: Finished task 0.0 in stage 53.0 (TID 1393). 7409 bytes result sent to driver

PR

Was this patch authored or co-authored using generative AI tooling?

No

HyukjinKwon · 2024-08-26T02:05:08Z

Let's probably file a new JIRA

dongjoon-hyun · 2024-09-11T15:27:22Z

Gentle ping, @cxzl25 and @mridulm .

Although we have enough time until Feature Freeze, I'm wondering if we can deliver this via Apache Spark 4.0.0-preview2 RC1 (next Monday). WDYT?

mridulm · 2024-09-12T05:19:20Z

I am a bit swamped unfortunately, and I dont think I will be able to ensure this gets merged before next monday @dongjoon-hyun - sorry about that :-(

@cxzl25, will try to get around to reviewing this soon - apologies for the delay

mridulm · 2024-09-12T05:19:36Z

+CC @Ngone51 as well.

dongjoon-hyun · 2024-09-12T15:06:54Z

Thank you for letting me know, @mridulm ~ No problem at all.

pan3793 · 2025-04-18T10:02:08Z

Kindly ping @mridulm, do you have a chance to take another look? I also found this PR is helpful for stability for jobs that spill huge data.

mridulm

Just a few comments, mostly looks good to me.
Thanks for working on this @cxzl25, and apologies for the delay in getting to this !

+CC @HyukjinKwon, @cloud-fan as well for review.

core/src/main/scala/org/apache/spark/internal/config/package.scala

mridulm · 2025-04-20T01:36:52Z

core/src/main/scala/org/apache/spark/util/collection/Spillable.scala

By moving _elementsRead > numElementsForceSpillThreshold here, we would actually reduce some unnecessary allocations .... nice !

core/src/main/scala/org/apache/spark/util/collection/Spillable.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

mridulm · 2025-04-20T01:55:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

The config name is a bit confusing.
spark.sql.windowExec.buffer.spill.threshold vs spark.sql.windowExec.buffer.spill.size.threshold.

Same for the others introduced.

I will let @HyukjinKwon or @cloud-fan comment better though.

I am not super used to this area. I would rarther follow the suggestions from you / others.

Thanks @HyukjinKwon !
+CC @dongjoon-hyun as well.

mridulm · 2025-05-03T03:18:23Z

I am planning to merge this next week if there are no concerns @cloud-fan , @dongjoon-hyun.
It has been open for quite a while, and is a very helpful fix to mitigate memory issues.

I am not super keen on the naming of some of the sql configs, would your thoughts on that (as well as rest of the PR).

Also, +CC @attilapiros for feedback as well.

core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java

core/src/main/scala/org/apache/spark/util/collection/Spillable.scala

rahil-c · 2025-06-24T21:27:55Z

@mridulm @cxzl25 @attilapiros @HyukjinKwon @pan3793

Hi all was just curious if there was any issues regarding this pr or if it will be merged in OSS Spark sometime soon? Thanks again for making this change!

mridulm · 2025-06-25T17:09:13Z

I did not merge it given @attilapiros was actively reviewing it.
Are there any other concerns/comments on this Attila ?

attilapiros · 2025-06-25T17:34:44Z

checking

attilapiros

LGTM after the code duplicate is resolved.

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowEvaluatorFactory.scala

When running large shuffles (700TB input data, 200k map tasks, 50k reducers on a 300 nodes cluster) the job is regularly OOMing in map and reduce phase. IIUC ShuffleExternalSorter (map side) and ExternalAppendOnlyMap and ExternalSorter (reduce side) are trying to max out the available execution memory. This in turn doesn't play nice with the Garbage Collector and executors are failing with OutOfMemoryError when the memory allocation from these in-memory structure is maxing out the available heap size (in our case we are running with 9 cores/executor, 32G per executor) To mitigate this, one can set spark.shuffle.spill.numElementsForceSpillThreshold to force the spill on disk. While this config works, it is not flexible enough as it's expressed in number of elements, and in our case we run multiple shuffles in a single job and element size is different from one stage to another. This patch extends the spill threshold behaviour and adds two new parameters to control the spill based on memory usage: - spark.shuffle.spill.map.maxRecordsSizeForSpillThreshold - spark.shuffle.spill.reduce.maxRecordsSizeForSpillThreshold

mridulm · 2025-07-03T04:46:59Z

If the current changes look good, can you merge it pls @attilapiros ?
I am travelling and dont have access to my desktop :)

rahil-c · 2025-07-03T17:42:03Z

Thank you! @mridulm @attilapiros @cxzl25 , looking forward to this change in coming spark release.

attilapiros · 2025-07-04T21:23:49Z

Merged to master.

attilapiros · 2025-07-04T21:34:48Z

Let's probably file a new JIRA

@HyukjinKwon Can I close the old jira (https://issues.apache.org/jira/browse/SPARK-27734) as a duplicate or what was your plan when you asked for a new ticket?

cloud-fan · 2025-08-25T03:43:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala

          initialSize,
          pageSizeBytes,
          numRowsSpillThreshold,
+          maxSizeSpillThreshold,


looking at this def add method, it trigger spilling only if the num rows exceeds numRowsInMemoryBufferThreshold. IIUC we are not using memory-based threshold here, as we keep appending data to a memory buffer based on the num rows threshold.

…memory based spill threshold ### What changes were proposed in this pull request? This is a followup of #47856 . It makes the memory tracking more accurate in several places: 1. In `ShuffleExternalSorter`/`UnsafeExternalSorter`, the memory is used by both the sorter itself, and its underlying in-memort sorter (for sorting shuffle partition ids). We need to add them up to calcuate the current memory usage. 2. In `ExternalAppendOnlyUnsafeRowArray`, the records are inserted to an in-memory buffer first. If the buffer gets too large (currently based on num records), we switch to `UnsafeExternalSorter`. The in-memory buffer also needs a memory based threshold ### Why are the changes needed? More accurate memory tracking results to better spill decisions ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet. ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #52190 from cloud-fan/spill. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Yi Wu <yi.wu@databricks.com>

### What changes were proposed in this pull request? This PR aims to document newly added `core` module configurations as a part of Apache Spark 4.1.0 preparation. ### Why are the changes needed? To help the users use new features easily. - #47856 - #51130 - #51163 - #51604 - #51630 - #51708 - #51885 - #52091 - #52382 ### Does this PR introduce _any_ user-facing change? No behavior change because this is a documentation update. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52626 from dongjoon-hyun/SPARK-53926. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…memory based spill threshold ### What changes were proposed in this pull request? This is a followup of apache#47856 . It makes the memory tracking more accurate in several places: 1. In `ShuffleExternalSorter`/`UnsafeExternalSorter`, the memory is used by both the sorter itself, and its underlying in-memort sorter (for sorting shuffle partition ids). We need to add them up to calcuate the current memory usage. 2. In `ExternalAppendOnlyUnsafeRowArray`, the records are inserted to an in-memory buffer first. If the buffer gets too large (currently based on num records), we switch to `UnsafeExternalSorter`. The in-memory buffer also needs a memory based threshold ### Why are the changes needed? More accurate memory tracking results to better spill decisions ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet. ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52190 from cloud-fan/spill. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Yi Wu <yi.wu@databricks.com>

### What changes were proposed in this pull request? This PR aims to document newly added `core` module configurations as a part of Apache Spark 4.1.0 preparation. ### Why are the changes needed? To help the users use new features easily. - apache#47856 - apache#51130 - apache#51163 - apache#51604 - apache#51630 - apache#51708 - apache#51885 - apache#52091 - apache#52382 ### Does this PR introduce _any_ user-facing change? No behavior change because this is a documentation update. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52626 from dongjoon-hyun/SPARK-53926. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added SQL STRUCTURED STREAMING CORE PYTHON labels Aug 23, 2024

cxzl25 changed the title ~~[SPARK-27734][CORE][SQL] Add memory based thresholds for shuffle spill~~ [SPARK-49386][SPARK-27734][CORE][SQL] Add memory based thresholds for shuffle spill Aug 26, 2024

cxzl25 mentioned this pull request Sep 11, 2024

[SPARK-27734][CORE][SQL] Add memory based thresholds for shuffle spill #24618

Closed

cxzl25 force-pushed the SPARK-27734 branch from 726f800 to b23a4c8 Compare November 6, 2024 07:19

cxzl25 force-pushed the SPARK-27734 branch from b23a4c8 to 781eb4b Compare December 19, 2024 03:17

cxzl25 force-pushed the SPARK-27734 branch from 781eb4b to 22e551c Compare January 17, 2025 03:20

cxzl25 force-pushed the SPARK-27734 branch from 22e551c to 09a9a9c Compare February 14, 2025 07:23

mridulm reviewed Apr 20, 2025

View reviewed changes

cxzl25 force-pushed the SPARK-27734 branch from 09a9a9c to 9962e07 Compare April 21, 2025 04:01

attilapiros reviewed May 3, 2025

View reviewed changes

core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java Outdated Show resolved Hide resolved

attilapiros reviewed May 3, 2025

View reviewed changes

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java Outdated Show resolved Hide resolved

attilapiros reviewed May 4, 2025

View reviewed changes

core/src/main/scala/org/apache/spark/util/collection/Spillable.scala Outdated Show resolved Hide resolved

cxzl25 force-pushed the SPARK-27734 branch from 95fc754 to 043dedb Compare May 26, 2025 04:06

attilapiros approved these changes Jun 25, 2025

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowEvaluatorFactory.scala Outdated Show resolved Hide resolved

amuraru and others added 3 commits June 26, 2025 15:40

spill by size

044a232

config

27c6478

cxzl25 added 9 commits June 26, 2025 15:40

codegen

a3f5d16

add test

14a7f2d

use _elementsRead

8374d18

checkValue fallbackConf

57f0711

log

d734bfa

shouldSpill val

7f78264

config version 4.1.0

b28bd7b

extra line

d083773

test

8225e94

cxzl25 force-pushed the SPARK-27734 branch from ec672ed to 8225e94 Compare June 26, 2025 07:40

cxzl25 requested review from attilapiros and mridulm July 2, 2025 13:10

attilapiros changed the title ~~[SPARK-49386][SPARK-27734][CORE][SQL] Add memory based thresholds for shuffle spill~~ [SPARK-49386][CORE][SQL] Add memory based thresholds for shuffle spill Jul 4, 2025

attilapiros closed this in a1d55d7 Jul 4, 2025

cloud-fan reviewed Aug 25, 2025

View reviewed changes

cloud-fan mentioned this pull request Sep 1, 2025

[SPARK-49386][CORE][SQL][FOLLOWUP] More accurate memory tracking for memory based spill threshold #52190

Closed

dongjoon-hyun mentioned this pull request Oct 15, 2025

[SPARK-53926][DOCS] Document newly added core module configurations #52626

Closed

[SPARK-49386][CORE][SQL] Add memory based thresholds for shuffle spill #47856

[SPARK-49386][CORE][SQL] Add memory based thresholds for shuffle spill #47856

Uh oh!

Conversation

cxzl25 commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Aug 26, 2024

Uh oh!

dongjoon-hyun commented Sep 11, 2024

Uh oh!

mridulm commented Sep 12, 2024

Uh oh!

mridulm commented Sep 12, 2024

Uh oh!

dongjoon-hyun commented Sep 12, 2024

Uh oh!

pan3793 commented Apr 18, 2025

Uh oh!

mridulm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mridulm Apr 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mridulm Apr 20, 2025

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

mridulm Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

mridulm commented May 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahil-c commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Jun 25, 2025

Uh oh!

attilapiros commented Jun 25, 2025

Uh oh!

attilapiros left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mridulm commented Jul 3, 2025

Uh oh!

rahil-c commented Jul 3, 2025

Uh oh!

attilapiros commented Jul 4, 2025

Uh oh!

attilapiros commented Jul 4, 2025

Uh oh!

cloud-fan Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

cxzl25 commented Aug 23, 2024 •

edited

Loading

rahil-c commented Jun 24, 2025 •

edited

Loading