-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-25815 RSGroupBasedLoadBalancer online status never updates after being set to true for the first time #3606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r being set to true for the first time
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One question.
@@ -825,6 +827,20 @@ private void createRSGroupTable() throws IOException { | |||
} | |||
|
|||
public boolean isOnline() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often is this check run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saintstack It's called by RSGroupBasedLoadBalancer#balanceCluster
(here), RSGroupInfoManagerImpl#refresh
(here), and RSGroupInfoManagerImpl#flushConfig
(here) so I would think semi-frequently? Not sure how often balanceCluster
gets called, but refresh
gets called upon RSGroupInfoManagerImpl
startup and flushConfig
gets called every time we add/remove servers or tables or change the rsgroups in any way.
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@@ -661,12 +663,14 @@ private synchronized void flushConfig(Map<String, RSGroupInfo> newGroupMap) thro | |||
return; | |||
} | |||
|
|||
// Make changes visible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because generally it is the case to update in-memory before persistent storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caroliney14 even if we don't change this order, are we still good with the main problem? I was wondering if this particular order of in-memory vs persistent update could be taken up in follow-up PR as well if we are good with the latest change in isOnline()
method.
@@ -825,6 +829,20 @@ private void createRSGroupTable() throws IOException { | |||
} | |||
|
|||
public boolean isOnline() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the intention here is we will not come back to offline mode after online?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that was the original intention, but it seems misleading/strange. We have encountered errors in prod resulting from trying to flush to hbase:rsgroup
when the table became unavailable after the initial check and set to "online" -- in that case, we should go the offline path, right? (Albeit this was in HBase 1, so I'm not 100% sure if HBase 2+ hasn't fixed this -- but from my understanding of the code, it hasn't?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part of code is written by me but I can not recall if it was already like this or I changed the implementation to make it only online once. Give some time to check code for different branches...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it was not me. It was like this when we first introduced this class in HBASE-6721. Even on branch-1 backport, the implementation is just return the isOnline flag. So if you want to change the implementation, please explain a bit about the reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that if we do not periodically update the online
status to reflect the availability of the hbase:rsgroup
table, we could become blocked waiting on a flush to the hbase:rsgroup
table when it can't be accessed (e.g. it's stuck in transition, offline, the rs hosting it has queueing, etc.). Each rsgroup functionality (add, move servers, move tables, remove, etc.) is synchronized
, and furthermore the multiMutate
function which does the persisting to hbase:rsgroup
uses Future.get
without timeout. So if hbase:rsgroup
is unavailable we will keep getting blocked until the client times out, and we will be unable to serve another rsgroup request in the meantime, when we could have exited early by checking for the availability of hbase:rsgroup
.
Instead of being blocked waiting like this, we can go through an "offline" code path. There already is an offline code path in flushConfig
which only updates the in-memory state of the default group (here), but we could also change it so that it updates in-memory state while asynchronously trying to persist it to hbase:rsgroup
in the background.
Please correct me if I misunderstood anything. What do you think about this rationale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Apache9 any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic seems good to me. @Apache9 Would you like to verify?
💔 -1 overall
This message was automatically generated. |
2 similar comments
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@caroliney14 can you please rebase once? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one comment, let's see how QA goes after latest rebase
@@ -661,12 +663,14 @@ private synchronized void flushConfig(Map<String, RSGroupInfo> newGroupMap) thro | |||
return; | |||
} | |||
|
|||
// Make changes visible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caroliney14 even if we don't change this order, are we still good with the main problem? I was wondering if this particular order of in-memory vs persistent update could be taken up in follow-up PR as well if we are good with the latest change in isOnline()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see one problem here. We should remove online = true
from waitForGroupTableOnline():
RSGroupInfoManagerImpl.this.refresh(true);
online = true;
Here, refresh(true) is already taking care of updating online
flag. If it updates it to false
, we should avoid resetting it to true
.
online = true; | ||
} | ||
} catch (Exception e) { | ||
LOG.warn("Failed to read from " + RSGROUP_TABLE_NAME+ "; setting online = false"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use log placeholder {}
for table name?
@@ -825,6 +829,20 @@ private void createRSGroupTable() throws IOException { | |||
} | |||
|
|||
public boolean isOnline() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic seems good to me. @Apache9 Would you like to verify?
No description provided.