-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-27781 Fix case of action counter assertion error in handling of batch operation timeout exceeded #7079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-2
Are you sure you want to change the base?
Conversation
… batch operation timeout exceeded
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
hbase-server test failures do not look related |
for (Action action : currentActions) { | ||
if (isOperationTimeoutExceeded()) { | ||
String message = numAttempt == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here has been preserved exactly inside of failIncompleteActionsWithOpTimeout
, I opted to move it into a new method because groupAndSendMultiAction
is already quite long and complex and we need to add more logic to it to handle this bug, I believe its better to do this timeout handling inside a seperate method that is clearly named with docstring
boolean actionAlreadyFailed = | ||
locateRegionFailedActions != null && locateRegionFailedActions.stream().anyMatch( | ||
failedAction -> failedAction.getOriginalIndex() == actionToFail.getOriginalIndex() | ||
&& failedAction.getReplicaId() == actionToFail.getReplicaId()); | ||
if (!actionAlreadyFailed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for avoiding the assertion error is here, rest of the method is existing logic from groupAndSendMulti
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug where actions that already failed due to location resolution are being double-failed during a batch operation timeout, causing the actions in progress counter to go negative.
- Added a new unit test to validate that actions with location failures aren’t double-failed.
- Refactored the timeout handling in AsyncRequestFutureImpl.java by introducing a helper method that excludes already failed actions from being failed again.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestClientOperationTimeout.java | Adds a new test case to validate correct handling of operation timeout with mixed action failures. |
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java | Introduces the failIncompleteActionsWithOpTimeout method and updates logic to avoid double failing actions. |
* decremented properly for all actions, see last catch block | ||
*/ | ||
@Test | ||
public void testMultiOperationTimoutWithLocationError() throws IOException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the test method name 'testMultiOperationTimoutWithLocationError'. Consider renaming it to 'testMultiOperationTimeoutWithLocationError' for clarity.
public void testMultiOperationTimoutWithLocationError() throws IOException, InterruptedException { | |
public void testMultiOperationTimeoutWithLocationError() throws IOException, InterruptedException { |
Copilot uses AI. Check for mistakes.
if (locateRegionFailedActions == null) { | ||
locateRegionFailedActions = new ArrayList<>(1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The null check and initialization for 'locateRegionFailedActions' is repeated in multiple places. Consider extracting this logic into a helper method to reduce duplication and improve maintainability.
if (locateRegionFailedActions == null) { | |
locateRegionFailedActions = new ArrayList<>(1); | |
} | |
locateRegionFailedActions = initializeIfNull(locateRegionFailedActions); |
Copilot uses AI. Check for mistakes.
https://issues.apache.org/jira/browse/HBASE-27781
+Background+
In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded during location resolution here. In that handling, we loop over all actions still being processed in the groupAndSendMulti at the time of the operation timeout being exceeded and set them as failed. The problem is, some number of these actions may have already failed to completion when we get to this spot - if we fail to resolve region location for an action we will fail it to completion in findAllLocationsOrFail (fail to completion == set the error for the action, decrement actions in progress counter, and do not retry the action again) - and we should not "double fail" any actions that were already failed due to failed location resolution because we will decrement the actions in progress counter twice for the same action, and throw off the (atomic) action counter accounting the sync client relies on to tell when the batch operation is complete.
+Problem+
In the for loop here we fail all actions (and decrement action in progress counter for all actions) in the groupAndSendMulti - which includes the aforementioned actions that were already failed through findAllLocationsOrFail - causing us to decrement the actions in progress counter more times than than there are actions if there was a location failure. This causes an assertion error in the actions in progress counter since we go negative here and should never have a negative amount of actions in progress, causing the HBase client to throw an unchecked exception that is not handled within the client which bubbles up to the user application layer invoking the client, which may kill the caller thread/application that invoked the operation that should have timed out with a RetriesExhaustedWithDetails exception (rather than throwing an unchecked AssertionError), as the user application layer should not be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.
+Triggering scenario/reproduction+
The most common scenario where one could hit this bug is if there is meta slowness when running batch operations. Suppose we have a batch with 3 actions, and on trying to resolve the location for the first action, we timeout repeatedly to the meta table due to meta slowness and consume the entire operation timeout on the meta scan attempts to resolve the location of the first action. In this case, we will fail the first action through findAllLocationsOrFail which bring the actionsInProgress counter to 2, and then we will loop over all three actions and fail each of them, on the third action failure attempt the actions in progress counter is zero and we attempt to decrement it to -1, and hit the assertion error. This is what the test case in the PR successfully reproduces.
+Solution+
We still want to fail all remaining/incomplete actions being processed in groupAndSendMulti at the time of the operation timeout being exceeded, because there is no time remaining to execute them, but we need special handling to avoid failing actions which were already failed due to failed location resolution.