-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-27781 Fix case of action counter assertion error in handling of batch operation timeout exceeded #6144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Two hbase-server tests that are not relevant to my changes are timing out : pushed empty commit to retrigger jenkins tests |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
da48212
to
a483bfe
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
In the case of AsyncRequestFutureImpl with null results/no result tracking , I cannot see a way of checking if a given action was completed through an external method like isActionComplete, I think my current approach would have some issues in certain scenarios when results are null, need to understand AsyncRequestFutureImpl, maybe we can keep track of the actions failed locally in |
I think keeping track of the actions failed locally in groupAndSendMultiAction as we loop over currentActions and avoiding failing an already locally failed action a second time if operation timeout is exceeded there should work for both null results and with replica actions. Working on those changes. |
a483bfe
to
b61e51e
Compare
@bbeaudreault may I kindly ask you to review this fix if you have the time |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
b61e51e
to
2838c81
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
2838c81
to
e379b25
Compare
This comment has been minimized.
This comment has been minimized.
e379b25
to
225e68c
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
… batch operation timeout exceeded
225e68c
to
6caae7a
Compare
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Going to open a new PR with clean history in attempt to be more review friendly |
https://issues.apache.org/jira/browse/HBASE-27781
My understanding of AsynRequestFutureImpl is that the only actions passed to
groupAndSendMultiAction
which may have been completed already and we already decremented action counter for by the time we get to the operation timeout exceeded handling in that method are actions which were failed locally ingroupAndSendMultiAction
due to location error - thefindAllLocationsOrFail
method sets the action error and decrements the action counter on location resolution failure - we just need to avoid setting the error again on actions where location resolution failed when we get to this loop to avoid decrementing twice for those actions.For replica calls,
manageLocationError
/manageError
is responsible for result/error setting and preventing double action counter decrementing.Test case reproduces the assertion error when one action in the batch fails on location error and then operation timeout is exceeded.