Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1631] fix(server): ShuffleTaskInfo may leak when app is removed #1632

Merged
merged 1 commit into from
Apr 10, 2024

Conversation

zhengchenyu
Copy link
Collaborator

What changes were proposed in this pull request?

Fix: #1631

Does this PR introduce any user-facing change?

No.

How was this patch tested?

no test, obvious mistake

Copy link

github-actions bot commented Apr 9, 2024

Test Results

 2 363 files  ±0   2 363 suites  ±0   4h 31m 3s ⏱️ -20s
   912 tests ±0     910 ✅  - 1   1 💤 ±0  1 ❌ +1 
10 585 runs  ±0  10 570 ✅  - 1  14 💤 ±0  1 ❌ +1 

For more details on these failures, see this check.

Results for commit aecca27. ± Comparison against base commit 3ea3aaa.

@zuston zuston changed the title [#1631] fix: ShuffleTaskInfo may leak when app is removed. [#1631] fix(server): ShuffleTaskInfo may leak when app is removed Apr 10, 2024
Copy link
Member

@zuston zuston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you help review this? @rickyma

Copy link
Contributor

@rickyma rickyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@zuston zuston merged commit 1431c8a into apache:master Apr 10, 2024
39 of 41 checks passed
zuston pushed a commit to zuston/incubator-uniffle that referenced this pull request May 27, 2024
…ed. (apache#1632)

### What changes were proposed in this pull request?

In our cluster, delete pod is denied by web hook, even though all application is deleted for long time.
When I curl http://host:ip/metrics/server, I found app_num_with_node is 1. 
The problem is some application is leaked. I see many duplicated logs `[INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[appattempt_xxx_xx_xx] according to rss.server.app.expired.withoutHeartbeat`.
When I jstack the server many times, clearResourceThread will be stuck forever, here is the call stack.
```
"clearResourceThread" apache#40 daemon prio=5 os_prio=0 cpu=3767.63ms elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition  [0x00007f24f7b33000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.22/Native Method)
	- parking to wait for  <0x00007f28d5e29f20> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.22/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.22/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.22/AbstractQueuedSynchronizer.java:917)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.22/AbstractQueuedSynchronizer.java:1240)
	at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@11.0.22/ReentrantReadWriteLock.java:959)
	at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756)
	at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
	at org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown Source)
	at java.lang.Thread.run(java.base@11.0.22/Thread.java:829)
```

Apparently there's a lock that's not being released. Looking at the code, it's easy to see that the read lock in the flushBuffer is not released correctly.  The log ` ShuffleBufferManager.flushBuffer - Shuffle[3066071] for app[appattempt_xxx] has already been removed, no need to flush the buffer` proved it.

Fix: apache#1631 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

no test, obvious mistake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] ShuffleTaskInfo may leak when app is removed.
3 participants