[Bug] Memory may leak when EventInvalidException occurs #1571

rickyma · 2024-03-12T12:38:42Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I have searched in the issues and found no similar issues.

Describe the bug

This issue may cause the memory of the shuffle server to never be released.

[2024-03-08 17:51:25.737] [clearResourceThread] [INFO] HybridStorageManager.removeResources - Start to remove resource of AppPurgeEvent{appId='application_1703049085550_5925694_1709883294114', user='tdwadmin', shuffleIds=[0, 1, 2]}
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorageManager.cleanupStorageSelectionCache - Cleaning the storage selection cache costs: 4(ms) for event: AppPurgeEvent{appId='application_1703049085550_5925694_1709883294114', user='tdwadmin', shuffleIds=[0, 1, 2]}
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Start to remove resource of application_1703049085550_5925694_1709883294114/0
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Finish remove resource of application_1703049085550_5925694_1709883294114/0, disk size is 736983533790 and 2 shuffle metadata
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Start to remove resource of application_1703049085550_5925694_1709883294114/1
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Finish remove resource of application_1703049085550_5925694_1709883294114/1, disk size is 736983533790 and 2 shuffle metadata
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Start to remove resource of application_1703049085550_5925694_1709883294114/2
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Finish remove resource of application_1703049085550_5925694_1709883294114/2, disk size is 134246092 and 1 shuffle metadata
[2024-03-08 17:51:25.764] [LocalFileFlushEventThreadPool-25] [WARN] ShuffleFlushManager.processFlushEvent - AppId application_1703049085550_5925694_1709883294114 was removed already, event:ShuffleDataFlushEvent: eventId=38029, appId=application_1703049085550_5925694_1709883294114, shuffleId=2, startPartition=13682, endPartition=13682, retryTimes=0, underStorage=LocalStorage, isPended=false, ownedByHugePartition=false should be dropped
[2024-03-08 17:51:25.770] [LocalFileFlushEventThreadPool-25] [WARN] ShuffleFlushManager.processFlushEvent - AppId application_1703049085550_5925694_1709883294114 was removed already, event:ShuffleDataFlushEvent: eventId=38030, appId=application_1703049085550_5925694_1709883294114, shuffleId=2, startPartition=13683, endPartition=13683, retryTimes=0, underStorage=LocalStorage, isPended=false, ownedByHugePartition=false should be dropped

This is due to the multithreading concurrency issue between the scheduled cleanup task and the flush event.

Affects Version(s)

master

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

rickyma · 2024-03-12T14:38:32Z

We already have a fix, still working on it.

jerqi · 2024-03-13T03:27:12Z

cc @zuston

… occurs

…urs (#1574) ### What changes were proposed in this pull request? In the implementation of the methods `flushBuffer`, `handleEventAndUpdateMetrics`, and `removeBufferByShuffleId`, read-write locks have been added to manage concurrency. This ensures that a `ShuffleBuffer` successfully converted into a `flushEvent` won't be cleaned up again by `removeBufferByShuffleId`, and a `ShuffleBuffer` already cleaned up by `removeBufferByShuffleId` won't be transformed back into a `flushEvent`. This effectively resolves the concurrency issue. ### Why are the changes needed? Fix #1571 & #1560 & #1542 The key logic of the PR is as follows: Before this PR: 1. A `ShuffleBuffer` is turned into a `FlushEvent`, and **_its blocks and size are cleared_** → 2. The `FlushEvent` is added to the flushing queue → 3. The method `removeBufferByShuffleId` is executed, which causes the following things to happen: 3.1. Running the following code snippet, but please note that in the code below, `buffer.getBlocks()` **_is empty and size is 0_**, because of the step 1 above: ``` for (ShuffleBuffer buffer : buffers) { buffer.getBlocks().forEach(spb -> spb.getData().release()); ShuffleServerMetrics.gaugeTotalPartitionNum.dec(); size += buffer.getSize(); } ``` 3.2. `appId` is removed from the `bufferPool` → 4. The `FlushEvent` is taken out from the queue and encounters an `EventInvalidException` because the `appId` was removed before → 5. When handling the `EventInvalidException`, nothing is done and the `event.doCleanup()` method **_is not called, causing a memory leak_**. Of course, this is just one scenario of concurrency exceptions. In the previous code, without locking, in the `processFlushEvent` method, it is possible that the event may become invalid at any time when continuing executing in `processFlushEvent` method, which is why there is #1542. Also, there is #1560. --- After this PR: We will set a read lock for steps 1 and 2 above, a write lock for step 3, a read lock for step 4, and when encountering an `EventInvalidException` in step 5, we will call the `event.doCleanup()` method to release the memory. In this way, we can ensure the following things when resources are being cleaned up: 1. `ShuffleBuffers` that have not yet been converted to `FlushEvents` will not be converted in the future, but will be directly cleaned up. 2. `FlushEvents` that have been converted from `ShuffleBuffers` will definitely encounter an `EventInvalidException`, and we will eventually handle this exception correctly, releasing memory. 3. If there is already a `FlushEvent` being processed and it is about to be flushed to disk, the resource cleanup task will wait for all `FlushEvents` related to the `appId` to be completed before starting the cleanup task, ensuring that the cleanup and flushing tasks are completely independent and do not interfere with each other. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. --------- Co-authored-by: leslizhang <leslizhang@tencent.com>

…leId is cleaned up

…n` occurs (apache#1574) In the implementation of the methods `flushBuffer`, `handleEventAndUpdateMetrics`, and `removeBufferByShuffleId`, read-write locks have been added to manage concurrency. This ensures that a `ShuffleBuffer` successfully converted into a `flushEvent` won't be cleaned up again by `removeBufferByShuffleId`, and a `ShuffleBuffer` already cleaned up by `removeBufferByShuffleId` won't be transformed back into a `flushEvent`. This effectively resolves the concurrency issue. Fix apache#1571 & apache#1560 & apache#1542 The key logic of the PR is as follows: Before this PR: 1. A `ShuffleBuffer` is turned into a `FlushEvent`, and **_its blocks and size are cleared_** → 2. The `FlushEvent` is added to the flushing queue → 3. The method `removeBufferByShuffleId` is executed, which causes the following things to happen: 3.1. Running the following code snippet, but please note that in the code below, `buffer.getBlocks()` **_is empty and size is 0_**, because of the step 1 above: ``` for (ShuffleBuffer buffer : buffers) { buffer.getBlocks().forEach(spb -> spb.getData().release()); ShuffleServerMetrics.gaugeTotalPartitionNum.dec(); size += buffer.getSize(); } ``` 3.2. `appId` is removed from the `bufferPool` → 4. The `FlushEvent` is taken out from the queue and encounters an `EventInvalidException` because the `appId` was removed before → 5. When handling the `EventInvalidException`, nothing is done and the `event.doCleanup()` method **_is not called, causing a memory leak_**. Of course, this is just one scenario of concurrency exceptions. In the previous code, without locking, in the `processFlushEvent` method, it is possible that the event may become invalid at any time when continuing executing in `processFlushEvent` method, which is why there is apache#1542. Also, there is apache#1560. --- After this PR: We will set a read lock for steps 1 and 2 above, a write lock for step 3, a read lock for step 4, and when encountering an `EventInvalidException` in step 5, we will call the `event.doCleanup()` method to release the memory. In this way, we can ensure the following things when resources are being cleaned up: 1. `ShuffleBuffers` that have not yet been converted to `FlushEvents` will not be converted in the future, but will be directly cleaned up. 2. `FlushEvents` that have been converted from `ShuffleBuffers` will definitely encounter an `EventInvalidException`, and we will eventually handle this exception correctly, releasing memory. 3. If there is already a `FlushEvent` being processed and it is about to be flushed to disk, the resource cleanup task will wait for all `FlushEvents` related to the `appId` to be completed before starting the cleanup task, ensuring that the cleanup and flushing tasks are completely independent and do not interfere with each other. No. Existing UTs. --------- Co-authored-by: leslizhang <leslizhang@tencent.com>

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 13, 2024

[apache#1571] fix(server): Memory may leak when EventInvalidException…

694100a

… occurs

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 13, 2024

[apache#1571] fix(server): Memory may leak when EventInvalidException…

e3f730a

… occurs

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 13, 2024

[apache#1571] fix(server): Memory may leak when EventInvalidException…

cc6d990

… occurs

leslizhang mentioned this issue Mar 14, 2024

[#1571] fix(server): Memory may leak when EventInvalidException occurs #1574

Merged

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024

[apache#1571] fix(server): Memory may leak when EventInvalidException…

69f2b57

… occurs

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024

[apache#1571] fix(server): Memory may leak when EventInvalidException…

0904f87

… occurs

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024

[apache#1571] fix(server): Memory may leak when EventInvalidException…

41e12ae

… occurs

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024

[apache#1571] fix(server): Memory may leak when EventInvalidException…

dbaeb06

… occurs

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024

[apache#1571] fix(server): Memory may leak when EventInvalidException…

7d6c01c

… occurs

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024

[apache#1571] fix(server): Memory may leak when EventInvalidException…

1786106

… occurs

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 18, 2024

[apache#1571] fix(server): Memory may leak when EventInvalidException…

6385268

… occurs

zuston closed this as completed in #1574 Mar 18, 2024

rickyma added a commit to rickyma/incubator-uniffle that referenced this issue Mar 21, 2024

[apache#1571][FOLLOWUP] improvement(server): Fail fast when the shuff…

79c51ab

…leId is cleaned up

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Memory may leak when EventInvalidException occurs #1571

[Bug] Memory may leak when EventInvalidException occurs #1571

rickyma commented Mar 12, 2024 •

edited

Loading

rickyma commented Mar 12, 2024

jerqi commented Mar 13, 2024

[Bug] Memory may leak when EventInvalidException occurs #1571

[Bug] Memory may leak when EventInvalidException occurs #1571

Comments

rickyma commented Mar 12, 2024 • edited Loading

Code of Conduct

Search before asking

Describe the bug

Affects Version(s)

Uniffle Server Log Output

Uniffle Engine Log Output

Uniffle Server Configurations

Uniffle Engine Configurations

Additional context

Are you willing to submit PR?

rickyma commented Mar 12, 2024

jerqi commented Mar 13, 2024

rickyma commented Mar 12, 2024 •

edited

Loading