Core: Fix possible deadlock in ParallelIterable #11781

sopel39 · 2024-12-13T15:00:47Z

It was observed that with high concurrency/high workload scenario cluster deadlocks due to manifest readers waiting for connection from S3 pool.

Specifically, ManifestGroup#plan will create ManifestReader per every ParallelIterable.Task. These readers will effectively hold onto S3 connection from the pool. When ParallelIterable queue is full, Task will be tabled for later use.

Consider scenario:
S3 connection pool size=1
approximateMaxQueueSize=1
workerPoolSize=1

ParallelIterable1: starts TaskP1
ParallelIterable1: TaskP1 produces result, queue gets full, TaskP1 is put on hold (holds S3 connection)
ParallelIterable2: starts TaskP2, TaskP2 is scheduled on workerPool but is blocked on S3 connection pool
ParallelIterable1: result gets consumed, TaskP1 is scheduled again
ParallelIterable1: TaskP1 waits for workerPool to be free, but TaskP2 is waiting for TaskP1 to release connection

The fix make sure Task is finished once it's started. This way limited resources like connection pool are not put on hold. Queue size might exceed strict limits, but it should still be bounded.

Fixes #11768

sopel39 · 2024-12-13T15:01:55Z

cc @findepi @RussellSpitzer @osscm

core/src/main/java/org/apache/iceberg/util/ParallelIterable.java

core/src/test/java/org/apache/iceberg/util/TestParallelIterable.java

core/src/main/java/org/apache/iceberg/util/ParallelIterable.java

core/src/test/java/org/apache/iceberg/util/TestParallelIterable.java

core/src/main/java/org/apache/iceberg/util/ParallelIterable.java

sopel39 · 2024-12-18T17:10:32Z

rebased

core/src/main/java/org/apache/iceberg/util/ParallelIterable.java

RussellSpitzer · 2025-01-03T23:00:46Z

From a discussion I had with @sopel39 today;

I think we can go forward this solution but I think it will basically re-introduce the memory usage issue that we saw previously. (For some use cases)

From our discussion I believe we have been working at this from the wrong direction.

Recap -

The original implementation basically assumed that we could read as made iterators as parallelism allowed. This led to basically unbounded memory usage. This ended up becoming an issue for systems using a lightweight coordinator which would be required to have a huge memory footprint (essentially all the metadata entires for a table would be held in memory per query).

Next, to solve our issue with the memory footprint we added what is essentially a buffered read-ahead. A queue is filled with elements from the files we are reading in parallel and we check the queue depth every time we add elements to bound its size. The max queue size here is checked at every pull so we really can never go more than parallelism items over the max queue size. Unfortunately, this leads to the current deadlock issue since we can potentially yield an iterator in the middle of a file and be left only with iterators for files which cannot yet be opened because all file handles are owned by yielded iterators.

The current proposed solution is to switch checking the queue size per element and instead check only before opening a new file for the first time. This means that any file that is opened is read completely into the read-ahead queue. This fixes the dead lock issue as we will never yield in the middle of the file but possibly reintroduces the memory issue. In the worst case scenario we would open up to "parallelism" files simultaneously and load them all into the queue before having a chance to check our queue size.

Where do we go from here -

The current implementation is basically trying to solve the issue of :
How do we read a bunch of unbounded streams in parallel without consuming too much memory?

But this is actually a bit over general for what we are actually trying to do. Our actual problem is
*How do we read from a bunch of files without reading too many records at a time?

The key difference here is that we know exactly how long each file is before we open it. Instead of simply opening
every file and checking how much we've actually read in parallel, we can open only as many files as are required to fill our read ahead buffer and yield on the others.

I think what we should do is a bit like this (haven't thought this part through too much - synchronization primitives are just for representation, I realize it won't work exactly like this)

AtomicLong usedBuffer

if file not opened
  Synchronized {
     if (usedBuffer + file.length < MAX_BUFFER) {
        usedBuffer += file.length
        openFile
     } else {
       yield, do not yet open this file 
     }
  }

Read File into Buffer

onNext {
  usedBuffer --
}

It was observed that with high concurrency/high workload scenario cluster deadlocks due to manifest readers waiting for connection from S3 pool. Specifically, ManifestGroup#plan will create ManifestReader per every ParallelIterable.Task. These readers will effectively hold onto S3 connection from the pool. When ParallelIterable queue is full, Task will be tabled for later use. Consider scenario: S3 connection pool size=1 approximateMaxQueueSize=1 workerPoolSize=1 ParallelIterable1: starts TaskP1 ParallelIterable1: TaskP1 produces result, queue gets full, TaskP1 is put on hold (holds S3 connection) ParallelIterable2: starts TaskP2, TaskP2 is scheduled on workerPool but is blocked on S3 connection pool ParallelIterable1: result gets consumed, TaskP1 is scheduled again ParallelIterable1: TaskP1 waits for workerPool to be free, but TaskP2 is waiting for TaskP1 to release connection The fix make sure Task is finished once it's started. This way limited resources like connection pool are not put on hold. Queue size might exceed strict limits, but it should still be bounded. Fixes apache#11768

RussellSpitzer · 2025-01-07T18:14:57Z

@findepi I'm on board with this but I want to make sure you are also happy with this. I'm unsure of whether having the ability to yield before files will really help memory pressure, so I'm slightly thinking we should just revert the yielding capabilities all together and just go back to the implementation from last year.

I'm willing to go forward though with either direction for now. In the future I think we really need a fake filesystem benchmarking test that we can use to simulate how the algorithms we write here will work since we are mostly working blind.

RussellSpitzer

As noted in my comment to Piotr, I think this is a fix to the deadlock but I think it may be better to just remove the yielding behavior all together till we have a better replacement. If folks have experience where file level yielding would appropriately limit memory usage, I think we can go forward with this as an interim solution.

stevenzwu

thanks @RussellSpitzer for the great summary.

This change looks fine.

I also feel ok to revert the yield change completely, as I feel this change of avoiding deadlock might make the yield solution ineffective anyway. The scenario with OOM was large thread pool (like 184 thread) and large manifest files (like hundreds of MBs or GBs).

@sopel39 I assume you tried this change and it avoided deadlock problem. Has this been tested with the OOM scenario for large manifest files?

stevenzwu · 2025-01-08T18:09:36Z

BTW, I like the new direction that @RussellSpitzer outlined. using byte size (instead of number of elements) is more intuitive and easier to calculate a good default to cap memory foot print. E.g., the max read bytes could be set to 1/10 of JVM heap size. 1/10 is just an example. We can estimate a proper default value based on typical file size/memory foot print ratio.

there could be an edge condition that a single manifest file is larger than the default value. In this case and if the buffer is empty or less than half full, task should be allowed.

sopel39 · 2025-01-09T12:06:57Z

@sopel39 I assume you tried this change and it avoided deadlock problem. Has this been tested with the OOM scenario for large manifest files?

We tested in staging env. So far coordinator didn't show any excessive mem usage

findepi · 2025-01-10T20:05:52Z

As noted in my comment to Piotr, I think this is a fix to the deadlock but I think it may be better to just remove the yielding behavior all together

Would this restore OOM problem?

RussellSpitzer · 2025-01-10T20:28:05Z

As noted in my comment to Piotr, I think this is a fix to the deadlock but I think it may be better to just remove the yielding behavior all together

Would this restore OOM problem?

Potentially this fix would also restore the OOM problem

findepi · 2025-01-10T21:08:49Z

IIRC the OOM problem was a real production problem (cc @raunaqmorarka @dekimir @losipiuk), so I am not convinced it's OK to restore it.

stevenzwu · 2025-01-10T22:37:07Z

@findepi the deadlock issue is probably the worse of the two evils. the deadlock issue probably needs to be addressed urgently.

findepi · 2025-01-11T16:12:36Z

It seems we're in agreement. The deadlock must be resolved. This PR supposedly avoids any deadlocks at the cost of increased memory usage, which is fair. I think we should not remove memory pressure protections.

findepi · 2025-01-11T16:12:52Z

Given that many approvals, it looks ready to go.

findepi · 2025-01-11T16:13:53Z

thank you @sopel39 for the PR and @osscm @RussellSpitzer @stevenzwu for reviews!

* Fix ParallelIterable deadlock It was observed that with high concurrency/high workload scenario cluster deadlocks due to manifest readers waiting for connection from S3 pool. Specifically, ManifestGroup#plan will create ManifestReader per every ParallelIterable.Task. These readers will effectively hold onto S3 connection from the pool. When ParallelIterable queue is full, Task will be tabled for later use. Consider scenario: S3 connection pool size=1 approximateMaxQueueSize=1 workerPoolSize=1 ParallelIterable1: starts TaskP1 ParallelIterable1: TaskP1 produces result, queue gets full, TaskP1 is put on hold (holds S3 connection) ParallelIterable2: starts TaskP2, TaskP2 is scheduled on workerPool but is blocked on S3 connection pool ParallelIterable1: result gets consumed, TaskP1 is scheduled again ParallelIterable1: TaskP1 waits for workerPool to be free, but TaskP2 is waiting for TaskP1 to release connection The fix make sure Task is finished once it's started. This way limited resources like connection pool are not put on hold. Queue size might exceed strict limits, but it should still be bounded. Fixes #11768 * Do not submit a task when there is no space in queue

github-actions bot added the core label Dec 13, 2024

osscm reviewed Dec 13, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/util/ParallelIterable.java Show resolved Hide resolved

sopel39 force-pushed the ks/parallel_fix branch from ec139c4 to da723f0 Compare December 13, 2024 20:40

RussellSpitzer reviewed Dec 13, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/util/ParallelIterable.java Show resolved Hide resolved

Fokko requested a review from findepi December 16, 2024 17:16

Fokko added this to the Iceberg 1.7.2 milestone Dec 17, 2024

findepi reviewed Dec 17, 2024

View reviewed changes

core/src/test/java/org/apache/iceberg/util/TestParallelIterable.java Show resolved Hide resolved

findepi reviewed Dec 17, 2024

View reviewed changes

sopel39 force-pushed the ks/parallel_fix branch 2 times, most recently from b072cc9 to cab3521 Compare December 18, 2024 13:00

sopel39 requested a review from findepi December 18, 2024 13:01

sopel39 force-pushed the ks/parallel_fix branch 2 times, most recently from b41baf9 to c94513b Compare December 18, 2024 17:10

sopel39 force-pushed the ks/parallel_fix branch from c94513b to 66ed87f Compare December 18, 2024 17:11

RussellSpitzer reviewed Jan 2, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/util/ParallelIterable.java Show resolved Hide resolved

sopel39 added 2 commits January 7, 2025 19:01

Do not submit a task when there is no space in queue

3436e7f

sopel39 force-pushed the ks/parallel_fix branch from 66ed87f to 3436e7f Compare January 7, 2025 18:02

sopel39 requested a review from RussellSpitzer January 7, 2025 18:02

RussellSpitzer approved these changes Jan 7, 2025

View reviewed changes

stevenzwu approved these changes Jan 8, 2025

View reviewed changes

findepi approved these changes Jan 11, 2025

View reviewed changes

findepi changed the title ~~Fix ParallelIterable deadlock~~ Core: Fix possible deadlock in ParallelIterable Jan 11, 2025

findepi merged commit c7910bb into apache:main Jan 11, 2025
49 checks passed

sopel39 deleted the ks/parallel_fix branch January 11, 2025 22:22

RussellSpitzer mentioned this pull request Jan 13, 2025

ParallelIterable: Queue Size w/ O(1) #11895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Fix possible deadlock in ParallelIterable #11781

Core: Fix possible deadlock in ParallelIterable #11781

sopel39 commented Dec 13, 2024

sopel39 commented Dec 13, 2024

sopel39 commented Dec 18, 2024

RussellSpitzer commented Jan 3, 2025 •

edited

Loading

RussellSpitzer commented Jan 7, 2025 •

edited

Loading

RussellSpitzer left a comment

stevenzwu left a comment •

edited

Loading

stevenzwu commented Jan 8, 2025

sopel39 commented Jan 9, 2025

findepi commented Jan 10, 2025

RussellSpitzer commented Jan 10, 2025

findepi commented Jan 10, 2025

stevenzwu commented Jan 10, 2025 •

edited

Loading

findepi commented Jan 11, 2025

findepi commented Jan 11, 2025

findepi commented Jan 11, 2025

Core: Fix possible deadlock in ParallelIterable #11781

Core: Fix possible deadlock in ParallelIterable #11781

Conversation

sopel39 commented Dec 13, 2024

sopel39 commented Dec 13, 2024

sopel39 commented Dec 18, 2024

RussellSpitzer commented Jan 3, 2025 • edited Loading

Recap -

Where do we go from here -

RussellSpitzer commented Jan 7, 2025 • edited Loading

RussellSpitzer left a comment

Choose a reason for hiding this comment

stevenzwu left a comment • edited Loading

Choose a reason for hiding this comment

stevenzwu commented Jan 8, 2025

sopel39 commented Jan 9, 2025

findepi commented Jan 10, 2025

RussellSpitzer commented Jan 10, 2025

findepi commented Jan 10, 2025

stevenzwu commented Jan 10, 2025 • edited Loading

findepi commented Jan 11, 2025

findepi commented Jan 11, 2025

findepi commented Jan 11, 2025

RussellSpitzer commented Jan 3, 2025 •

edited

Loading

RussellSpitzer commented Jan 7, 2025 •

edited

Loading

stevenzwu left a comment •

edited

Loading

stevenzwu commented Jan 10, 2025 •

edited

Loading