-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-29380 Two concurrent remove peer requests may hang #7077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've successfully reproduced the problem with a simple UT. Let me think how to fix it properly. |
This comment has been minimized.
This comment has been minimized.
I refactored the MasterProcedureScheduler so the cleanup work for all type of queues share the same cleanup function, so I think it is enough to just test one type in the UT. Let's wait for the pre commit result. |
This comment has been minimized.
This comment has been minimized.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses HBASE-29380 by adding a test that reproduces the hanging scenario for concurrent peer-removal requests and refactors the scheduler’s cleanup logic to unify and extend support to global queues and table deletion.
- Introduce
TestProcedureWaitAndWake
to simulate two concurrent remove-peer procedures and ensure they complete. - Consolidate per-entity cleanup (peer, server, global) into a generic
tryCleanupQueue
helper. - Add a
markTableAsDeleted
method for safely removing table queues under test restrictions.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestProcedureWaitAndWake.java | Adds a new test to simulate and verify that two concurrent remove-peer procedures do not hang. |
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java | Refactors cleanup methods into a generic tryCleanupQueue , adds global‐queue cleanup, and exposes markTableAsDeleted for tests. |
Comments suppressed due to low confidence (3)
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestProcedureWaitAndWake.java:166
- The test calls waitFor but does not assert its result. If the condition times out, the test will still pass. Consider wrapping with an assertTrue on the waitFor return value to ensure failures are detected.
UTIL.waitFor(10000, () -> procExec.isFinished(id1));
hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestProcedureWaitAndWake.java:167
- As above, add an assertion for the waitFor return value to ensure the second procedure also finishes within the timeout.
UTIL.waitFor(10000, () -> procExec.isFinished(id2));
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java:608
- [nitpick] The parameter
getMap
is aSupplier<TNode>
but its name is ambiguous. Consider renaming toqueueSupplier
ormapSupplier
for clarity.
private <T extends Comparable<T>, TNode extends Queue<T>> boolean tryCleanupQueue(T id,
...-server/src/main/java/org/apache/hadoop/hbase/master/procedure/MasterProcedureScheduler.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find. Nice test.
Signed-off-by: Nihal Jain <nihaljain@apache.org> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org> (cherry picked from commit 7cc2f54)
Signed-off-by: Nihal Jain <nihaljain@apache.org> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org> (cherry picked from commit 7cc2f54)
(cherry picked from commit 7cc2f54) Signed-off-by: Nihal Jain <nihaljain@apache.org> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
No description provided.