Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detected memory leak on Comet columnar shuffle when AQE coalesce partitions enabled #381

Closed
viirya opened this issue May 4, 2024 · 0 comments · Fixed by #380
Closed
Labels
bug Something isn't working

Comments

@viirya
Copy link
Member

viirya commented May 4, 2024

Describe the bug

There are a few test failures caused by memory leak reported by Java Arrow. They are found in #250 after enabling columnar shuffle by default on Spark SQL tests. For example,

In AdaptiveQueryExecSuite:

[info] - SPARK-35455: Unify empty relation optimization between normal and AQE optimizer - single join *** FAILED *** (3 seconds, 170 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 729.0 failed 1 times, most recent failure: Lost task 0.0 in stage 729.0 (TID 1631) (e2b4fe719fb3 executor driver): org.apache.comet.CometNativeException: java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (32)
[info] Allocator(StreamReader/CometBlockStoreShuffleReader) 0/32/32/9223372036854775807 (res/actual/peak/limit)
[info] 
[info] 	at org.apache.comet.Native.executePlan(Native Method)
[info] 	at org.apache.comet.CometExecIterator.executeNative(CometExecIterator.scala:71)
[info] 	at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:123)
[info] 	at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:138)

After debugging these failures, seems it is triggered if AQE coalesce partitions enabled.

I think it is because when coalesce partition is enabled, some partitions (of multiple reducers) are combined together, which causing incorrect format to read at Arrow StreamReader.

For now, we should disable Comet columnar shuffle if AQE coalesce partitions enabled.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant