-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Memory may leak when EventInvalidException occurs #1571
Comments
We already have a fix, still working on it. |
cc @zuston |
leslizhang
pushed a commit
to leslizhang/incubator-uniffle
that referenced
this issue
Mar 13, 2024
leslizhang
pushed a commit
to leslizhang/incubator-uniffle
that referenced
this issue
Mar 13, 2024
leslizhang
pushed a commit
to leslizhang/incubator-uniffle
that referenced
this issue
Mar 13, 2024
leslizhang
pushed a commit
to leslizhang/incubator-uniffle
that referenced
this issue
Mar 15, 2024
leslizhang
pushed a commit
to leslizhang/incubator-uniffle
that referenced
this issue
Mar 15, 2024
leslizhang
pushed a commit
to leslizhang/incubator-uniffle
that referenced
this issue
Mar 15, 2024
leslizhang
pushed a commit
to leslizhang/incubator-uniffle
that referenced
this issue
Mar 15, 2024
leslizhang
pushed a commit
to leslizhang/incubator-uniffle
that referenced
this issue
Mar 15, 2024
leslizhang
pushed a commit
to leslizhang/incubator-uniffle
that referenced
this issue
Mar 15, 2024
leslizhang
pushed a commit
to leslizhang/incubator-uniffle
that referenced
this issue
Mar 18, 2024
zuston
pushed a commit
that referenced
this issue
Mar 18, 2024
…urs (#1574) ### What changes were proposed in this pull request? In the implementation of the methods `flushBuffer`, `handleEventAndUpdateMetrics`, and `removeBufferByShuffleId`, read-write locks have been added to manage concurrency. This ensures that a `ShuffleBuffer` successfully converted into a `flushEvent` won't be cleaned up again by `removeBufferByShuffleId`, and a `ShuffleBuffer` already cleaned up by `removeBufferByShuffleId` won't be transformed back into a `flushEvent`. This effectively resolves the concurrency issue. ### Why are the changes needed? Fix #1571 & #1560 & #1542 The key logic of the PR is as follows: Before this PR: 1. A `ShuffleBuffer` is turned into a `FlushEvent`, and **_its blocks and size are cleared_** → 2. The `FlushEvent` is added to the flushing queue → 3. The method `removeBufferByShuffleId` is executed, which causes the following things to happen: 3.1. Running the following code snippet, but please note that in the code below, `buffer.getBlocks()` **_is empty and size is 0_**, because of the step 1 above: ``` for (ShuffleBuffer buffer : buffers) { buffer.getBlocks().forEach(spb -> spb.getData().release()); ShuffleServerMetrics.gaugeTotalPartitionNum.dec(); size += buffer.getSize(); } ``` 3.2. `appId` is removed from the `bufferPool` → 4. The `FlushEvent` is taken out from the queue and encounters an `EventInvalidException` because the `appId` was removed before → 5. When handling the `EventInvalidException`, nothing is done and the `event.doCleanup()` method **_is not called, causing a memory leak_**. Of course, this is just one scenario of concurrency exceptions. In the previous code, without locking, in the `processFlushEvent` method, it is possible that the event may become invalid at any time when continuing executing in `processFlushEvent` method, which is why there is #1542. Also, there is #1560. --- After this PR: We will set a read lock for steps 1 and 2 above, a write lock for step 3, a read lock for step 4, and when encountering an `EventInvalidException` in step 5, we will call the `event.doCleanup()` method to release the memory. In this way, we can ensure the following things when resources are being cleaned up: 1. `ShuffleBuffers` that have not yet been converted to `FlushEvents` will not be converted in the future, but will be directly cleaned up. 2. `FlushEvents` that have been converted from `ShuffleBuffers` will definitely encounter an `EventInvalidException`, and we will eventually handle this exception correctly, releasing memory. 3. If there is already a `FlushEvent` being processed and it is about to be flushed to disk, the resource cleanup task will wait for all `FlushEvents` related to the `appId` to be completed before starting the cleanup task, ensuring that the cleanup and flushing tasks are completely independent and do not interfere with each other. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. --------- Co-authored-by: leslizhang <leslizhang@tencent.com>
rickyma
added a commit
to rickyma/incubator-uniffle
that referenced
this issue
Mar 21, 2024
…leId is cleaned up
zuston
pushed a commit
to zuston/incubator-uniffle
that referenced
this issue
May 27, 2024
…n` occurs (apache#1574) In the implementation of the methods `flushBuffer`, `handleEventAndUpdateMetrics`, and `removeBufferByShuffleId`, read-write locks have been added to manage concurrency. This ensures that a `ShuffleBuffer` successfully converted into a `flushEvent` won't be cleaned up again by `removeBufferByShuffleId`, and a `ShuffleBuffer` already cleaned up by `removeBufferByShuffleId` won't be transformed back into a `flushEvent`. This effectively resolves the concurrency issue. Fix apache#1571 & apache#1560 & apache#1542 The key logic of the PR is as follows: Before this PR: 1. A `ShuffleBuffer` is turned into a `FlushEvent`, and **_its blocks and size are cleared_** → 2. The `FlushEvent` is added to the flushing queue → 3. The method `removeBufferByShuffleId` is executed, which causes the following things to happen: 3.1. Running the following code snippet, but please note that in the code below, `buffer.getBlocks()` **_is empty and size is 0_**, because of the step 1 above: ``` for (ShuffleBuffer buffer : buffers) { buffer.getBlocks().forEach(spb -> spb.getData().release()); ShuffleServerMetrics.gaugeTotalPartitionNum.dec(); size += buffer.getSize(); } ``` 3.2. `appId` is removed from the `bufferPool` → 4. The `FlushEvent` is taken out from the queue and encounters an `EventInvalidException` because the `appId` was removed before → 5. When handling the `EventInvalidException`, nothing is done and the `event.doCleanup()` method **_is not called, causing a memory leak_**. Of course, this is just one scenario of concurrency exceptions. In the previous code, without locking, in the `processFlushEvent` method, it is possible that the event may become invalid at any time when continuing executing in `processFlushEvent` method, which is why there is apache#1542. Also, there is apache#1560. --- After this PR: We will set a read lock for steps 1 and 2 above, a write lock for step 3, a read lock for step 4, and when encountering an `EventInvalidException` in step 5, we will call the `event.doCleanup()` method to release the memory. In this way, we can ensure the following things when resources are being cleaned up: 1. `ShuffleBuffers` that have not yet been converted to `FlushEvents` will not be converted in the future, but will be directly cleaned up. 2. `FlushEvents` that have been converted from `ShuffleBuffers` will definitely encounter an `EventInvalidException`, and we will eventually handle this exception correctly, releasing memory. 3. If there is already a `FlushEvent` being processed and it is about to be flushed to disk, the resource cleanup task will wait for all `FlushEvents` related to the `appId` to be completed before starting the cleanup task, ensuring that the cleanup and flushing tasks are completely independent and do not interfere with each other. No. Existing UTs. --------- Co-authored-by: leslizhang <leslizhang@tencent.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Code of Conduct
Search before asking
Describe the bug
This issue may cause the memory of the shuffle server to never be released.
This is due to the multithreading concurrency issue between the scheduled cleanup task and the flush event.
Affects Version(s)
master
Uniffle Server Log Output
No response
Uniffle Engine Log Output
No response
Uniffle Server Configurations
No response
Uniffle Engine Configurations
No response
Additional context
No response
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: