-
Notifications
You must be signed in to change notification settings - Fork 267
Closed
Description
Describe the bug
In some configurations/environments, I see queries fail due to memory pool requests being rejected, but I would expect Comet to spill to disk instead.
In one example, I am running TPC-H @ SF=1000 (1TB) in k8s. I am specifying spark.comet.exec.replaceSortMergeJoin=false to force the use of CometSortMergeJoinExec.
--conf spark.executor.instances=4 \
--conf spark.executor.cores=8 \
--conf spark.executor.memory=8G \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=4g \
I allocated 4 GB of off-heap memory, which equates to 512 MB per core.
I saw memory requests fail with the memory pool limit at ~512MB.
I then doubled the off-heap memory, and I still see the same issue; however, the pool is now ~1GB. I would expect spilling to kick in instead.
org.apache.comet.CometNativeException: Additional allocation failed with top memory consumers (across reservations) as:
ExternalSorter[107]#2991(can spill: true) consumed 1024.2 MB,
ExternalSorterMerge[107]#2990(can spill: false) consumed 16.7 MB,
GroupedHashAggregateStream[107] ()#2994(can spill: true) consumed 0.0 B,
GroupedHashAggregateStream[107] ()#2995(can spill: true) consumed 0.0 B,
ExternalSorterMerge[107]#2992(can spill: false) consumed 0.0 B,
I also see pods being killed due to OOM:
NAME READY STATUS RESTARTS AGE
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-1 0/1 OOMKilled 0 11m
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-3 0/1 OOMKilled 0 11m
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-4 0/1 OOMKilled 0 11m
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-5 1/1 Running 0 4s
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-6 1/1 Running 0 4s
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-7 1/1 Running 0 3s
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-8 0/1 ContainerCreating 0 1s
I also see errors in the executor logs:
25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called on 917504 bytes but task only has 0 bytes of memory from the off-heap execution pool
25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called on 839664 bytes but task only has 0 bytes of memory from the off-heap execution pool
25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called on 917504 bytes but task only has 0 bytes of memory from the off-heap execution pool
Some related Spark logging:
25/09/24 21:27:20 INFO TaskMemoryManager: Memory used in task 20632
25/09/24 21:27:20 INFO TaskMemoryManager: 1073741824 bytes of memory were used by task 20632 but are not associated with specific consumers
25/09/24 21:27:20 INFO TaskMemoryManager: 5677121232 bytes of memory are used for execution and 962961 bytes of memory are used for storage
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working