Skip to content

HBASE-27781 Fix case of action counter assertion error in handling of batch operation timeout exceeded #7079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: branch-2
Choose a base branch
from

Conversation

droudnitsky
Copy link
Contributor

@droudnitsky droudnitsky commented Jun 8, 2025

https://issues.apache.org/jira/browse/HBASE-27781

+Background+

In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded during location resolution here. In that handling, we loop over all actions still being processed in the groupAndSendMulti at the time of the operation timeout being exceeded and set them as failed. The problem is, some number of these actions may have already failed to completion when we get to this spot - if we fail to resolve region location for an action we will fail it to completion in findAllLocationsOrFail (fail to completion == set the error for the action, decrement actions in progress counter, and do not retry the action again) - and we should not "double fail" any actions that were already failed due to failed location resolution because we will decrement the actions in progress counter twice for the same action, and throw off the (atomic) action counter accounting the sync client relies on to tell when the batch operation is complete.

+Problem+

In the for loop here we fail all actions (and decrement action in progress counter for all actions) in the groupAndSendMulti - which includes the aforementioned actions that were already failed through findAllLocationsOrFail - causing us to decrement the actions in progress counter more times than than there are actions if there was a location failure. This causes an assertion error in the actions in progress counter since we go negative here and should never have a negative amount of actions in progress, causing the HBase client to throw an unchecked exception that is not handled within the client which bubbles up to the user application layer invoking the client, which may kill the caller thread/application that invoked the operation that should have timed out with a RetriesExhaustedWithDetails exception (rather than throwing an unchecked AssertionError), as the user application layer should not be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.

+Triggering scenario/reproduction+

The most common scenario where one could hit this bug is if there is meta slowness when running batch operations. Suppose we have a batch with 3 actions, and on trying to resolve the location for the first action, we timeout repeatedly to the meta table due to meta slowness and consume the entire operation timeout on the meta scan attempts to resolve the location of the first action. In this case, we will fail the first action through  findAllLocationsOrFail which bring the actionsInProgress counter to 2, and then we will loop over all three actions and fail each of them, on the third action failure attempt the actions in progress counter is zero and we attempt to decrement it to -1, and hit the assertion error. This is what the test case in the PR successfully reproduces. 

+Solution+
We still want to fail all remaining/incomplete actions being processed in groupAndSendMulti at the time of the operation timeout being exceeded, because there is no time remaining to execute them, but we need special handling to avoid failing actions which were already failed due to failed location resolution. 

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 3m 34s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ branch-2 Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for branch
+1 💚 mvninstall 3m 19s branch-2 passed
+1 💚 compile 3m 57s branch-2 passed
+1 💚 checkstyle 0m 56s branch-2 passed
+1 💚 spotbugs 2m 22s branch-2 passed
+1 💚 spotless 0m 48s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 17s Maven dependency ordering for patch
+1 💚 mvninstall 3m 5s the patch passed
+1 💚 compile 3m 48s the patch passed
+1 💚 javac 3m 48s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 16s hbase-client: The patch generated 0 new + 11 unchanged - 1 fixed = 11 total (was 12)
+1 💚 checkstyle 0m 38s The patch passed checkstyle in hbase-server
+1 💚 spotbugs 2m 37s the patch passed
+1 💚 hadoopcheck 17m 9s Patch does not cause any errors with Hadoop 2.10.2 or 3.3.6 3.4.0.
+1 💚 spotless 0m 44s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 18s The patch does not generate ASF License warnings.
46m 3s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7079
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux af10e10fbd40 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 6caae7a
Default Java Eclipse Adoptium-11.0.23+9
Max. process+thread count 79 (vs. ulimit of 30000)
modules C: hbase-client hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 45s Docker mode activated.
-0 ⚠️ yetus 0m 4s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+0 🆗 mvndep 0m 11s Maven dependency ordering for branch
+1 💚 mvninstall 3m 14s branch-2 passed
+1 💚 compile 1m 20s branch-2 passed
+1 💚 javadoc 0m 46s branch-2 passed
+1 💚 shadedjars 6m 18s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 16s Maven dependency ordering for patch
+1 💚 mvninstall 3m 7s the patch passed
+1 💚 compile 1m 17s the patch passed
+1 💚 javac 1m 17s the patch passed
+1 💚 javadoc 0m 45s the patch passed
+1 💚 shadedjars 6m 17s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 8m 17s hbase-client in the patch passed.
-1 ❌ unit 204m 52s /patch-unit-hbase-server.txt hbase-server in the patch failed.
242m 18s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #7079
Optional Tests javac javadoc unit compile shadedjars
uname Linux 6ed088a2355a 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 6caae7a
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/testReport/
Max. process+thread count 4391 (vs. ulimit of 30000)
modules C: hbase-client hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 3m 11s Docker mode activated.
-0 ⚠️ yetus 0m 4s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+0 🆗 mvndep 0m 10s Maven dependency ordering for branch
+1 💚 mvninstall 3m 6s branch-2 passed
+1 💚 compile 1m 13s branch-2 passed
+1 💚 javadoc 0m 49s branch-2 passed
+1 💚 shadedjars 5m 53s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 13s Maven dependency ordering for patch
+1 💚 mvninstall 2m 59s the patch passed
+1 💚 compile 1m 13s the patch passed
+1 💚 javac 1m 13s the patch passed
+1 💚 javadoc 0m 48s the patch passed
+1 💚 shadedjars 5m 47s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 8m 20s hbase-client in the patch passed.
-1 ❌ unit 227m 50s /patch-unit-hbase-server.txt hbase-server in the patch failed.
266m 24s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #7079
Optional Tests javac javadoc unit compile shadedjars
uname Linux 04bc06e2c740 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 6caae7a
Default Java Temurin-1.8.0_412-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/testReport/
Max. process+thread count 4349 (vs. ulimit of 30000)
modules C: hbase-client hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 46s Docker mode activated.
-0 ⚠️ yetus 0m 5s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+0 🆗 mvndep 0m 14s Maven dependency ordering for branch
+1 💚 mvninstall 3m 22s branch-2 passed
+1 💚 compile 1m 12s branch-2 passed
+1 💚 javadoc 0m 44s branch-2 passed
+1 💚 shadedjars 6m 39s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 17s Maven dependency ordering for patch
+1 💚 mvninstall 3m 11s the patch passed
+1 💚 compile 1m 12s the patch passed
+1 💚 javac 1m 12s the patch passed
+1 💚 javadoc 0m 44s the patch passed
+1 💚 shadedjars 6m 39s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 8m 22s hbase-client in the patch passed.
-1 ❌ unit 238m 4s /patch-unit-hbase-server.txt hbase-server in the patch failed.
276m 41s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #7079
Optional Tests javac javadoc unit compile shadedjars
uname Linux 31fa1399831c 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 6caae7a
Default Java Eclipse Adoptium-11.0.23+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/testReport/
Max. process+thread count 4429 (vs. ulimit of 30000)
modules C: hbase-client hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@droudnitsky
Copy link
Contributor Author

hbase-server test failures do not look related

for (Action action : currentActions) {
if (isOperationTimeoutExceeded()) {
String message = numAttempt == 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here has been preserved exactly inside of failIncompleteActionsWithOpTimeout , I opted to move it into a new method because groupAndSendMultiAction is already quite long and complex and we need to add more logic to it to handle this bug, I believe its better to do this timeout handling inside a seperate method that is clearly named with docstring

Comment on lines +583 to +587
boolean actionAlreadyFailed =
locateRegionFailedActions != null && locateRegionFailedActions.stream().anyMatch(
failedAction -> failedAction.getOriginalIndex() == actionToFail.getOriginalIndex()
&& failedAction.getReplicaId() == actionToFail.getReplicaId());
if (!actionAlreadyFailed) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for avoiding the assertion error is here, rest of the method is existing logic from groupAndSendMulti

@Apache9 Apache9 requested a review from Copilot June 21, 2025 14:03
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug where actions that already failed due to location resolution are being double-failed during a batch operation timeout, causing the actions in progress counter to go negative.

  • Added a new unit test to validate that actions with location failures aren’t double-failed.
  • Refactored the timeout handling in AsyncRequestFutureImpl.java by introducing a helper method that excludes already failed actions from being failed again.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestClientOperationTimeout.java Adds a new test case to validate correct handling of operation timeout with mixed action failures.
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java Introduces the failIncompleteActionsWithOpTimeout method and updates logic to avoid double failing actions.

* decremented properly for all actions, see last catch block
*/
@Test
public void testMultiOperationTimoutWithLocationError() throws IOException, InterruptedException {
Copy link
Preview

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo in the test method name 'testMultiOperationTimoutWithLocationError'. Consider renaming it to 'testMultiOperationTimeoutWithLocationError' for clarity.

Suggested change
public void testMultiOperationTimoutWithLocationError() throws IOException, InterruptedException {
public void testMultiOperationTimeoutWithLocationError() throws IOException, InterruptedException {

Copilot uses AI. Check for mistakes.

Comment on lines +461 to +463
if (locateRegionFailedActions == null) {
locateRegionFailedActions = new ArrayList<>(1);
}
Copy link
Preview

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The null check and initialization for 'locateRegionFailedActions' is repeated in multiple places. Consider extracting this logic into a helper method to reduce duplication and improve maintainability.

Suggested change
if (locateRegionFailedActions == null) {
locateRegionFailedActions = new ArrayList<>(1);
}
locateRegionFailedActions = initializeIfNull(locateRegionFailedActions);

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants