You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-47910][CORE] close stream when DiskBlockObjectWriter closeResources to avoid memory leak
### What changes were proposed in this pull request?
close stream when DiskBlockObjectWriter closeResources to avoid memory leak
### Why are the changes needed?
[SPARK-34647](https://issues.apache.org/jira/browse/SPARK-34647) replaced the ZstdInputStream with ZstdInputStreamNoFinalizer. This meant that all usages of CompressionCodec.compressedOutputStream would need to manually close the stream as this would no longer be handled by the finalizer mechanism.
When using zstd for shuffle write compression, if for some reason the execution of this process is interrupted(eg. enable spark.sql.execution.interruptOnCancel and cancel Job). The memory used by `ZstdInputStreamNoFinalizer` may not be freed, causing a memory leak.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
#### Spark Shell Configuration
```
$> export SPARK_SUBMIT_OPTS="-XX:+AlwaysPreTouch -Xms1g"
$> $SPARK_HOME/bin/spark-shell --conf spark.io.compression.codec=zstd
```
#### Test Script
```scala
import java.util.concurrent.TimeUnit
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import scala.util.Random
sc.setJobGroup("jobA", "this is a job to be cancelled", interruptOnCancel = true)
(1 to 50).foreach { batch => {
val jobA = Future {
val df1 = spark.range(2000000).map { _ =>(Random.nextString(20),Random.nextInt(1000),Random.nextInt(1000),Random.nextInt(10))}.toDF("a","b","c","d")
val df2 = spark.range(2000000).map { _ =>(Random.nextString(20),Random.nextInt(1000),Random.nextInt(1000),Random.nextInt(10))}.toDF("a","b","c","d")
df1.join(df2,"b").show() }
Thread.sleep(5000)
sc.cancelJobGroup("jobA")
}}
```
#### Memory Monitor
```
$> while true; do echo \"$(date +%Y-%m-%d' '%H:%M:%S)\",$(pmap -x <PID> | grep "total kB" | awk '{print $4}'); sleep 10; done;
```
#### Results
##### Before
```
"2024-05-13 16:54:23",1332384
"2024-05-13 16:54:33",1417112
"2024-05-13 16:54:43",2211684
"2024-05-13 16:54:53",3060820
"2024-05-13 16:55:03",3850444
"2024-05-13 16:55:14",4631744
"2024-05-13 16:55:24",5317200
"2024-05-13 16:55:34",6019464
"2024-05-13 16:55:44",6489180
"2024-05-13 16:55:54",7255548
"2024-05-13 16:56:05",7718728
"2024-05-13 16:56:15",8388392
"2024-05-13 16:56:25",8927636
"2024-05-13 16:56:36",9473412
"2024-05-13 16:56:46",10000380
"2024-05-13 16:56:56",10344024
"2024-05-13 16:57:07",10734204
"2024-05-13 16:57:17",11211900
"2024-05-13 16:57:27",11665524
"2024-05-13 16:57:38",12268976
"2024-05-13 16:57:48",12896264
"2024-05-13 16:57:58",13572244
"2024-05-13 16:58:09",14252416
"2024-05-13 16:58:19",14915560
"2024-05-13 16:58:30",15484196
"2024-05-13 16:58:40",16170324
```
##### After
```
"2024-05-13 16:35:44",1355428
"2024-05-13 16:35:54",1391028
"2024-05-13 16:36:04",1673720
"2024-05-13 16:36:14",2103716
"2024-05-13 16:36:24",2129876
"2024-05-13 16:36:35",2166412
"2024-05-13 16:36:45",2177672
"2024-05-13 16:36:55",2188340
"2024-05-13 16:37:05",2190688
"2024-05-13 16:37:15",2195168
"2024-05-13 16:37:26",2199296
"2024-05-13 16:37:36",2228052
"2024-05-13 16:37:46",2238104
"2024-05-13 16:37:56",2260624
"2024-05-13 16:38:06",2307184
"2024-05-13 16:38:16",2331140
"2024-05-13 16:38:27",2323388
"2024-05-13 16:38:37",2357552
"2024-05-13 16:38:47",2352948
"2024-05-13 16:38:57",2364744
"2024-05-13 16:39:07",2368528
"2024-05-13 16:39:18",2385492
"2024-05-13 16:39:28",2389184
"2024-05-13 16:39:38",2388060
"2024-05-13 16:39:48",2388336
"2024-05-13 16:39:58",2386916
```
### Was this patch authored or co-authored using generative AI tooling?
No
Closes#46131 from JacobZheng0927/zstdMemoryLeak.
Authored-by: JacobZheng0927 <zsh517559523@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
0 commit comments