Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Memory may leak when EventInvalidException occurs #1571

Closed
3 tasks done
rickyma opened this issue Mar 12, 2024 · 2 comments · Fixed by #1574
Closed
3 tasks done

[Bug] Memory may leak when EventInvalidException occurs #1571

rickyma opened this issue Mar 12, 2024 · 2 comments · Fixed by #1574

Comments

@rickyma
Copy link
Contributor

rickyma commented Mar 12, 2024

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

This issue may cause the memory of the shuffle server to never be released.
image

[2024-03-08 17:51:25.737] [clearResourceThread] [INFO] HybridStorageManager.removeResources - Start to remove resource of AppPurgeEvent{appId='application_1703049085550_5925694_1709883294114', user='tdwadmin', shuffleIds=[0, 1, 2]}
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorageManager.cleanupStorageSelectionCache - Cleaning the storage selection cache costs: 4(ms) for event: AppPurgeEvent{appId='application_1703049085550_5925694_1709883294114', user='tdwadmin', shuffleIds=[0, 1, 2]}
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Start to remove resource of application_1703049085550_5925694_1709883294114/0
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Finish remove resource of application_1703049085550_5925694_1709883294114/0, disk size is 736983533790 and 2 shuffle metadata
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Start to remove resource of application_1703049085550_5925694_1709883294114/1
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Finish remove resource of application_1703049085550_5925694_1709883294114/1, disk size is 736983533790 and 2 shuffle metadata
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Start to remove resource of application_1703049085550_5925694_1709883294114/2
[2024-03-08 17:51:25.743] [clearResourceThread] [INFO] LocalStorage.removeResources - Finish remove resource of application_1703049085550_5925694_1709883294114/2, disk size is 134246092 and 1 shuffle metadata
[2024-03-08 17:51:25.764] [LocalFileFlushEventThreadPool-25] [WARN] ShuffleFlushManager.processFlushEvent - AppId application_1703049085550_5925694_1709883294114 was removed already, event:ShuffleDataFlushEvent: eventId=38029, appId=application_1703049085550_5925694_1709883294114, shuffleId=2, startPartition=13682, endPartition=13682, retryTimes=0, underStorage=LocalStorage, isPended=false, ownedByHugePartition=false should be dropped
[2024-03-08 17:51:25.770] [LocalFileFlushEventThreadPool-25] [WARN] ShuffleFlushManager.processFlushEvent - AppId application_1703049085550_5925694_1709883294114 was removed already, event:ShuffleDataFlushEvent: eventId=38030, appId=application_1703049085550_5925694_1709883294114, shuffleId=2, startPartition=13683, endPartition=13683, retryTimes=0, underStorage=LocalStorage, isPended=false, ownedByHugePartition=false should be dropped

This is due to the multithreading concurrency issue between the scheduled cleanup task and the flush event.

Affects Version(s)

master

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@rickyma
Copy link
Contributor Author

rickyma commented Mar 12, 2024

We already have a fix, still working on it.

@jerqi
Copy link
Contributor

jerqi commented Mar 13, 2024

cc @zuston

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 13, 2024
leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 13, 2024
leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 13, 2024
leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024
leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024
leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024
leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024
leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024
leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 15, 2024
leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this issue Mar 18, 2024
zuston pushed a commit that referenced this issue Mar 18, 2024
…urs (#1574)

### What changes were proposed in this pull request?

In the implementation of the methods `flushBuffer`, `handleEventAndUpdateMetrics`, and `removeBufferByShuffleId`, read-write locks have been added to manage concurrency. This ensures that a `ShuffleBuffer` successfully converted into a `flushEvent` won't be cleaned up again by `removeBufferByShuffleId`, and a `ShuffleBuffer` already cleaned up by `removeBufferByShuffleId` won't be transformed back into a `flushEvent`. This effectively resolves the concurrency issue.

### Why are the changes needed?

Fix #1571 & #1560 & #1542

The key logic of the PR is as follows:

Before this PR:
1. A `ShuffleBuffer` is turned into a `FlushEvent`, and **_its blocks and size are cleared_**
→
2. The `FlushEvent` is added to the flushing queue
→
3. The method `removeBufferByShuffleId` is executed, which causes the following things to happen:

3.1. Running the following code snippet, but please note that in the code below, `buffer.getBlocks()` **_is empty and size is 0_**, because of the step 1 above:
```
for (ShuffleBuffer buffer : buffers) {
  buffer.getBlocks().forEach(spb -> spb.getData().release());
  ShuffleServerMetrics.gaugeTotalPartitionNum.dec();
  size += buffer.getSize();
}
```

3.2. `appId` is removed from the `bufferPool`
→
4. The `FlushEvent` is taken out from the queue and encounters an `EventInvalidException` because the `appId` was removed before
→
5. When handling the `EventInvalidException`, nothing is done and the `event.doCleanup()` method **_is not called, causing a memory leak_**.
Of course, this is just one scenario of concurrency exceptions. In the previous code, without locking, in the `processFlushEvent` method, it is possible that the event may become invalid at any time when continuing executing in `processFlushEvent` method, which is why there is #1542. Also, there is #1560.

---

After this PR:
We will set a read lock for steps 1 and 2 above, a write lock for step 3, a read lock for step 4, and when encountering an `EventInvalidException` in step 5, we will call the `event.doCleanup()` method to release the memory.

In this way, we can ensure the following things when resources are being cleaned up:
1. `ShuffleBuffers` that have not yet been converted to `FlushEvents` will not be converted in the future, but will be directly cleaned up.
2. `FlushEvents` that have been converted from `ShuffleBuffers` will definitely encounter an `EventInvalidException`, and we will eventually handle this exception correctly, releasing memory.
3. If there is already a `FlushEvent` being processed and it is about to be flushed to disk, the resource cleanup task will wait for all `FlushEvents` related to the `appId` to be completed before starting the cleanup task, ensuring that the cleanup and flushing tasks are completely independent and do not interfere with each other.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UTs.
---------

Co-authored-by: leslizhang <leslizhang@tencent.com>
rickyma added a commit to rickyma/incubator-uniffle that referenced this issue Mar 21, 2024
zuston pushed a commit to zuston/incubator-uniffle that referenced this issue May 27, 2024
…n` occurs (apache#1574)

In the implementation of the methods `flushBuffer`, `handleEventAndUpdateMetrics`, and `removeBufferByShuffleId`, read-write locks have been added to manage concurrency. This ensures that a `ShuffleBuffer` successfully converted into a `flushEvent` won't be cleaned up again by `removeBufferByShuffleId`, and a `ShuffleBuffer` already cleaned up by `removeBufferByShuffleId` won't be transformed back into a `flushEvent`. This effectively resolves the concurrency issue.

Fix apache#1571 & apache#1560 & apache#1542

The key logic of the PR is as follows:

Before this PR:
1. A `ShuffleBuffer` is turned into a `FlushEvent`, and **_its blocks and size are cleared_**
→
2. The `FlushEvent` is added to the flushing queue
→
3. The method `removeBufferByShuffleId` is executed, which causes the following things to happen:

3.1. Running the following code snippet, but please note that in the code below, `buffer.getBlocks()` **_is empty and size is 0_**, because of the step 1 above:
```
for (ShuffleBuffer buffer : buffers) {
  buffer.getBlocks().forEach(spb -> spb.getData().release());
  ShuffleServerMetrics.gaugeTotalPartitionNum.dec();
  size += buffer.getSize();
}
```

3.2. `appId` is removed from the `bufferPool`
→
4. The `FlushEvent` is taken out from the queue and encounters an `EventInvalidException` because the `appId` was removed before
→
5. When handling the `EventInvalidException`, nothing is done and the `event.doCleanup()` method **_is not called, causing a memory leak_**.
Of course, this is just one scenario of concurrency exceptions. In the previous code, without locking, in the `processFlushEvent` method, it is possible that the event may become invalid at any time when continuing executing in `processFlushEvent` method, which is why there is apache#1542. Also, there is apache#1560.

---

After this PR:
We will set a read lock for steps 1 and 2 above, a write lock for step 3, a read lock for step 4, and when encountering an `EventInvalidException` in step 5, we will call the `event.doCleanup()` method to release the memory.

In this way, we can ensure the following things when resources are being cleaned up:
1. `ShuffleBuffers` that have not yet been converted to `FlushEvents` will not be converted in the future, but will be directly cleaned up.
2. `FlushEvents` that have been converted from `ShuffleBuffers` will definitely encounter an `EventInvalidException`, and we will eventually handle this exception correctly, releasing memory.
3. If there is already a `FlushEvent` being processed and it is about to be flushed to disk, the resource cleanup task will wait for all `FlushEvents` related to the `appId` to be completed before starting the cleanup task, ensuring that the cleanup and flushing tasks are completely independent and do not interfere with each other.

No.

Existing UTs.
---------

Co-authored-by: leslizhang <leslizhang@tencent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants