-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leaks when running the TPC-H benchmark repeatedly #884
Comments
I've investigated the problem and found that the leak is caused by these 2 allocations: https://github.com/apache/datafusion-comet/blob/0.2.0/common/src/main/scala/org/apache/comet/vector/NativeUtil.scala#L65-L66 val arrowSchema = ArrowSchema.allocateNew(allocator)
val arrowArray = ArrowArray.allocateNew(allocator) This is for constructing the Arrow C data structures for transferring Arrow batch vectors from Scala (JVM) to the native executor (Rust). The native executor will move the transferred vectors and take ownership of them, but the I applied a fix on my fork and the problem went away: Kontinuation@a90f43a The native memory allocated by
Running TPC-H benchmarks with AQE coalesce partitions enabled still has memory leak problem, I'm still investigating it. |
Finally, I fixed the memory leak caused by AQE coalesce partitions: Kontinuation@8657a82 The ArrowStreamWriter holds arrow dictionaries internally, the native memory held by dictionaries will leak if the writer is not closed. I've also tried reverting #613 on my fix, the allocator in StreamReader could be closed without reporting leaks when running all 22 TPC-H queries. |
Hmm, arrowSchema and arrowArray structures should be automatically released when native side drops the imported array/schema. The release callback should do that. I think this follows C Data interface. |
I'm afraid that's not the case. To my understanding, the release callback frees everything except the base structure. Please refer to the reference implementation of the release handler in the specification. |
If you are referring the JVM instances of |
|
Ah, I took another look at JVM |
Describe the bug
I've built Datafusion Comet using commit f7f0bb1 for Spark 3.5.1. I found that the memory usage keeps increasing when repeatedly running the TPC-H benchmark script on a set of parquet files. The parquet files were generated using https://github.com/databricks/spark-sql-perf with scale factor = 10. The memory usage could be as high as 20GB. Given the spark and comet configurations I'm using to run the benchmarks (see Additional context) this seems to be problematic.
I've noticed that the native memory allocated by
Unsafe_AllocateMemory0
keeps increasing usingjcmd VM.native_memory detail.diff | grep Unsafe -A 2
. I'm not enabling offheap memory so the allocation should be initiated by the arrowRootAllocator
:Initially after setting the baseline:
After 10 minutes:
The leaked memory were allocated by the
CometArrowAllocator
. I've verified this by attaching a debugger to the Spark process and inspectedCometArrowAllocator.getAllocatedMemory
:I've also deliberately disabled AQE coalesce partitions since I noticed this issue: #381. Although it is fixed I still disabled it for being safe. See Additional context section for more details.
Steps to reproduce
Run the TPC-H benchmark script with
--iterations=100
and observe the RSS of the Spark process increase over time.Expected behavior
Memory usage should not increase over time.
Additional context
I'm simply running it locally with
master = local[4]
. Here are my test environment and spark configurations:Environment:
Spark configurations:
The text was updated successfully, but these errors were encountered: