HBASE-27781 Fix case of action counter assertion error in handling of batch operation timeout exceeded #7079

droudnitsky · 2025-06-08T17:30:01Z

https://issues.apache.org/jira/browse/HBASE-27781

+Background+

In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded during location resolution here. In that handling, we loop over all actions still being processed in the groupAndSendMulti at the time of the operation timeout being exceeded and set them as failed. The problem is, some number of these actions may have already failed to completion when we get to this spot - if we fail to resolve region location for an action we will fail it to completion in findAllLocationsOrFail (fail to completion == set the error for the action, decrement actions in progress counter, and do not retry the action again) - and we should not "double fail" any actions that were already failed due to failed location resolution because we will decrement the actions in progress counter twice for the same action, and throw off the (atomic) action counter accounting the sync client relies on to tell when the batch operation is complete.

+Problem+

In the for loop here we fail all actions (and decrement action in progress counter for all actions) in the groupAndSendMulti - which includes the aforementioned actions that were already failed through findAllLocationsOrFail - causing us to decrement the actions in progress counter more times than than there are actions if there was a location failure. This causes an assertion error in the actions in progress counter since we go negative here and should never have a negative amount of actions in progress, causing the HBase client to throw an unchecked exception that is not handled within the client which bubbles up to the user application layer invoking the client, which may kill the caller thread/application that invoked the operation that should have timed out with a RetriesExhaustedWithDetails exception (rather than throwing an unchecked AssertionError), as the user application layer should not be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.

+Triggering scenario/reproduction+

The most common scenario where one could hit this bug is if there is meta slowness when running batch operations. Suppose we have a batch with 3 actions, and on trying to resolve the location for the first action, we timeout repeatedly to the meta table due to meta slowness and consume the entire operation timeout on the meta scan attempts to resolve the location of the first action. In this case, we will fail the first action through findAllLocationsOrFail which bring the actionsInProgress counter to 2, and then we will loop over all three actions and fail each of them, on the third action failure attempt the actions in progress counter is zero and we attempt to decrement it to -1, and hit the assertion error. This is what the test case in the PR successfully reproduces.

+Solution+
We still want to fail all remaining/incomplete actions being processed in groupAndSendMulti at the time of the operation timeout being exceeded, because there is no time remaining to execute them, but we need special handling to avoid failing actions which were already failed due to failed location resolution.

… batch operation timeout exceeded

Apache-HBase · 2025-06-08T18:21:59Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	3m 34s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	hbaseanti	0m 0s		Patch does not have any anti-patterns.
			_ branch-2 Compile Tests _
+0 🆗	mvndep	0m 12s		Maven dependency ordering for branch
+1 💚	mvninstall	3m 19s		branch-2 passed
+1 💚	compile	3m 57s		branch-2 passed
+1 💚	checkstyle	0m 56s		branch-2 passed
+1 💚	spotbugs	2m 22s		branch-2 passed
+1 💚	spotless	0m 48s		branch has no errors when running spotless:check.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 17s		Maven dependency ordering for patch
+1 💚	mvninstall	3m 5s		the patch passed
+1 💚	compile	3m 48s		the patch passed
+1 💚	javac	3m 48s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 16s		hbase-client: The patch generated 0 new + 11 unchanged - 1 fixed = 11 total (was 12)
+1 💚	checkstyle	0m 38s		The patch passed checkstyle in hbase-server
+1 💚	spotbugs	2m 37s		the patch passed
+1 💚	hadoopcheck	17m 9s		Patch does not cause any errors with Hadoop 2.10.2 or 3.3.6 3.4.0.
+1 💚	spotless	0m 44s		patch has no errors when running spotless:check.
			_ Other Tests _
+1 💚	asflicense	0m 18s		The patch does not generate ASF License warnings.
		46m 3s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#7079
Optional Tests	dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname	Linux af10e10fbd40 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `6caae7a`
Default Java	Eclipse Adoptium-11.0.23+9
Max. process+thread count	79 (vs. ulimit of 30000)
modules	C: hbase-client hbase-server U: .
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/console
versions	git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2025-06-08T21:38:10Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 45s		Docker mode activated.
-0 ⚠️	yetus	0m 4s		Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
			_ Prechecks _
			_ branch-2 Compile Tests _
+0 🆗	mvndep	0m 11s		Maven dependency ordering for branch
+1 💚	mvninstall	3m 14s		branch-2 passed
+1 💚	compile	1m 20s		branch-2 passed
+1 💚	javadoc	0m 46s		branch-2 passed
+1 💚	shadedjars	6m 18s		branch has no errors when building our shaded downstream artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 16s		Maven dependency ordering for patch
+1 💚	mvninstall	3m 7s		the patch passed
+1 💚	compile	1m 17s		the patch passed
+1 💚	javac	1m 17s		the patch passed
+1 💚	javadoc	0m 45s		the patch passed
+1 💚	shadedjars	6m 17s		patch has no errors when building our shaded downstream artifacts.
			_ Other Tests _
+1 💚	unit	8m 17s		hbase-client in the patch passed.
-1 ❌	unit	204m 52s	/patch-unit-hbase-server.txt	hbase-server in the patch failed.
		242m 18s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR	#7079
Optional Tests	javac javadoc unit compile shadedjars
uname	Linux 6ed088a2355a 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `6caae7a`
Default Java	Eclipse Adoptium-17.0.11+9
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/testReport/
Max. process+thread count	4391 (vs. ulimit of 30000)
modules	C: hbase-client hbase-server U: .
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/console
versions	git=2.34.1 maven=3.9.8
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2025-06-08T22:03:28Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	3m 11s		Docker mode activated.
-0 ⚠️	yetus	0m 4s		Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
			_ Prechecks _
			_ branch-2 Compile Tests _
+0 🆗	mvndep	0m 10s		Maven dependency ordering for branch
+1 💚	mvninstall	3m 6s		branch-2 passed
+1 💚	compile	1m 13s		branch-2 passed
+1 💚	javadoc	0m 49s		branch-2 passed
+1 💚	shadedjars	5m 53s		branch has no errors when building our shaded downstream artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 13s		Maven dependency ordering for patch
+1 💚	mvninstall	2m 59s		the patch passed
+1 💚	compile	1m 13s		the patch passed
+1 💚	javac	1m 13s		the patch passed
+1 💚	javadoc	0m 48s		the patch passed
+1 💚	shadedjars	5m 47s		patch has no errors when building our shaded downstream artifacts.
			_ Other Tests _
+1 💚	unit	8m 20s		hbase-client in the patch passed.
-1 ❌	unit	227m 50s	/patch-unit-hbase-server.txt	hbase-server in the patch failed.
		266m 24s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR	#7079
Optional Tests	javac javadoc unit compile shadedjars
uname	Linux 04bc06e2c740 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `6caae7a`
Default Java	Temurin-1.8.0_412-b08
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/testReport/
Max. process+thread count	4349 (vs. ulimit of 30000)
modules	C: hbase-client hbase-server U: .
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/console
versions	git=2.34.1 maven=3.9.8
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2025-06-08T22:12:31Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 46s		Docker mode activated.
-0 ⚠️	yetus	0m 5s		Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
			_ Prechecks _
			_ branch-2 Compile Tests _
+0 🆗	mvndep	0m 14s		Maven dependency ordering for branch
+1 💚	mvninstall	3m 22s		branch-2 passed
+1 💚	compile	1m 12s		branch-2 passed
+1 💚	javadoc	0m 44s		branch-2 passed
+1 💚	shadedjars	6m 39s		branch has no errors when building our shaded downstream artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 17s		Maven dependency ordering for patch
+1 💚	mvninstall	3m 11s		the patch passed
+1 💚	compile	1m 12s		the patch passed
+1 💚	javac	1m 12s		the patch passed
+1 💚	javadoc	0m 44s		the patch passed
+1 💚	shadedjars	6m 39s		patch has no errors when building our shaded downstream artifacts.
			_ Other Tests _
+1 💚	unit	8m 22s		hbase-client in the patch passed.
-1 ❌	unit	238m 4s	/patch-unit-hbase-server.txt	hbase-server in the patch failed.
		276m 41s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#7079
Optional Tests	javac javadoc unit compile shadedjars
uname	Linux 31fa1399831c 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `6caae7a`
Default Java	Eclipse Adoptium-11.0.23+9
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/testReport/
Max. process+thread count	4429 (vs. ulimit of 30000)
modules	C: hbase-client hbase-server U: .
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7079/1/console
versions	git=2.34.1 maven=3.9.8
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

droudnitsky · 2025-06-14T15:56:48Z

hbase-server test failures do not look related

droudnitsky · 2025-06-14T16:36:39Z

hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java

    for (Action action : currentActions) {
      if (isOperationTimeoutExceeded()) {
-        String message = numAttempt == 1


The logic here has been preserved exactly inside of failIncompleteActionsWithOpTimeout , I opted to move it into a new method because groupAndSendMultiAction is already quite long and complex and we need to add more logic to it to handle this bug, I believe its better to do this timeout handling inside a seperate method that is clearly named with docstring

droudnitsky · 2025-06-14T16:40:11Z

hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java

+      boolean actionAlreadyFailed =
+        locateRegionFailedActions != null && locateRegionFailedActions.stream().anyMatch(
+          failedAction -> failedAction.getOriginalIndex() == actionToFail.getOriginalIndex()
+            && failedAction.getReplicaId() == actionToFail.getReplicaId());
+      if (!actionAlreadyFailed) {


The logic for avoiding the assertion error is here, rest of the method is existing logic from groupAndSendMulti

Copilot

Pull Request Overview

This PR fixes a bug where actions that already failed due to location resolution are being double-failed during a batch operation timeout, causing the actions in progress counter to go negative.

Added a new unit test to validate that actions with location failures aren’t double-failed.
Refactored the timeout handling in AsyncRequestFutureImpl.java by introducing a helper method that excludes already failed actions from being failed again.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestClientOperationTimeout.java	Adds a new test case to validate correct handling of operation timeout with mixed action failures.
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java	Introduces the failIncompleteActionsWithOpTimeout method and updates logic to avoid double failing actions.

Copilot · 2025-06-21T14:03:57Z

hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestClientOperationTimeout.java

+   * decremented properly for all actions, see last catch block
+   */
+  @Test
+  public void testMultiOperationTimoutWithLocationError() throws IOException, InterruptedException {


There is a typo in the test method name 'testMultiOperationTimoutWithLocationError'. Consider renaming it to 'testMultiOperationTimeoutWithLocationError' for clarity.

Suggested change

public void testMultiOperationTimoutWithLocationError() throws IOException, InterruptedException {

public void testMultiOperationTimeoutWithLocationError() throws IOException, InterruptedException {

Copilot · 2025-06-21T14:03:57Z

hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java

+        if (locateRegionFailedActions == null) {
+          locateRegionFailedActions = new ArrayList<>(1);
+        }


[nitpick] The null check and initialization for 'locateRegionFailedActions' is repeated in multiple places. Consider extracting this logic into a helper method to reduce duplication and improve maintainability.

Suggested change

if (locateRegionFailedActions == null) {

locateRegionFailedActions = new ArrayList<>(1);

}

locateRegionFailedActions = initializeIfNull(locateRegionFailedActions);

HBASE-27781 Fix case of action counter assertion error in handling of…

6caae7a

… batch operation timeout exceeded

droudnitsky commented Jun 14, 2025

View reviewed changes

Apache9 requested a review from Copilot June 21, 2025 14:03

Copilot AI reviewed Jun 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HBASE-27781 Fix case of action counter assertion error in handling of batch operation timeout exceeded #7079

HBASE-27781 Fix case of action counter assertion error in handling of batch operation timeout exceeded #7079

droudnitsky commented Jun 8, 2025 •

edited

Loading

Uh oh!

Apache-HBase commented Jun 8, 2025

Uh oh!

Apache-HBase commented Jun 8, 2025

Uh oh!

Apache-HBase commented Jun 8, 2025

Uh oh!

Apache-HBase commented Jun 8, 2025

Uh oh!

droudnitsky commented Jun 14, 2025

Uh oh!

droudnitsky Jun 14, 2025

Uh oh!

droudnitsky Jun 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 21, 2025

Uh oh!

Copilot AI Jun 21, 2025

Uh oh!

Uh oh!

	public void testMultiOperationTimoutWithLocationError() throws IOException, InterruptedException {
	public void testMultiOperationTimeoutWithLocationError() throws IOException, InterruptedException {

HBASE-27781 Fix case of action counter assertion error in handling of batch operation timeout exceeded #7079

Are you sure you want to change the base?

HBASE-27781 Fix case of action counter assertion error in handling of batch operation timeout exceeded #7079

Conversation

droudnitsky commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Apache-HBase commented Jun 8, 2025

Uh oh!

Apache-HBase commented Jun 8, 2025

Uh oh!

Apache-HBase commented Jun 8, 2025

Uh oh!

Apache-HBase commented Jun 8, 2025

Uh oh!

droudnitsky commented Jun 14, 2025

Uh oh!

droudnitsky Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

droudnitsky Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

droudnitsky commented Jun 8, 2025 •

edited

Loading