-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-28342 Decommissioned hosts should be rejected by the HMaster #5667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting here -- I haven't read through your tests yet.
Summary:
- nothing in zookeeper is permanent
- an operator should be able to remove a host from the draining list, however it's specified -- we need APIs for this.
- the "match host name only" boolean is strange. I think what you actually want here is a unix-style field mask, specifying which of the ServerName triple are omitted when considering a match.
- don't swallow exceptions
* Region unloading is asynchronous. | ||
* @param servers The list of servers to decommission. | ||
* @param offload True to offload the regions from the decommissioned servers | ||
* @param matchHostNameOnly True to prevent the hostname from ever joining again, regardless of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This language is a bit exaggerated, right? The host can join again, just remove this entry. You can tone it back and say something like "True to reject this host regardless of its startcode or port."
@@ -2509,10 +2509,21 @@ public CompletableFuture<String> getLocks() { | |||
@Override | |||
public CompletableFuture<Void> decommissionRegionServers(List<ServerName> servers, | |||
boolean offload) { | |||
// By default, when we decommission a RegionServer we don't mark the hostname as permanently | |||
// decommissioned and instead mark the server location (host + port + startCode) as such | |||
boolean matchHostNameOnly = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By calling down to the higher arity method you're duplicating the default decision that's already described in protobuf. Doing so is not a terrible thing, but it is redundant. I'd say, leave the default in one place if at all possible.
@@ -1558,8 +1558,14 @@ public static GetQuotaStatesRequest buildGetQuotaStatesRequest() { | |||
|
|||
public static DecommissionRegionServersRequest | |||
buildDecommissionRegionServersRequest(List<ServerName> servers, boolean offload) { | |||
return RequestConverter.buildDecommissionRegionServersRequest(servers, offload, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about duplicating the default state.
@@ -696,6 +696,7 @@ message ListDecommissionedRegionServersResponse { | |||
message DecommissionRegionServersRequest { | |||
repeated ServerName server_name = 1; | |||
required bool offload = 2; | |||
optional bool match_host_name_only = 3 [default = false]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments for maintainers in both proto files that this default value should match the other.
List<String> servers = | ||
ZKUtil.listChildrenAndWatchThem(watcher, watcher.getZNodePaths().drainingZNode); | ||
add(servers); | ||
|
||
if (servers != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit -- it's safer to put the null-check in the method than to burden all callers with the responsibility.
byte[] data = DrainedZNodeServerData.newBuilder().setMatchHostNameOnly(matchHostNameOnly) | ||
.build().toByteArray(); | ||
// Create a node with binary data | ||
ZKUtil.createAndFailSilent(getZooKeeper(), node, data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not fail silently. If it cannot create the znode that marks the decommissioning , it should fail the RPC. Let the exception throw -- the existing catch clause appears to handle it appropriately.
try { | ||
byte[] rawData = ZKUtil.getData(getZooKeeper(), node); | ||
|
||
// Check if the data is present for backwards compatibility, some nodes may not have it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most nodes won't have this data as the common use-case is to drain by hostname,port,startcode.
try { | ||
ZKUtil.deleteNodeFailSilent(getZooKeeper(), node); | ||
if (shouldBePermanentlyDecommissioned) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a variant of this method that allows the operator to remove a server by hostname only to be removed from the decommissioning list. Unless I'm missing something, once a server is added to the list with this flag enabled, it cannot be removed again.
@@ -696,6 +696,7 @@ message ListDecommissionedRegionServersResponse { | |||
message DecommissionRegionServersRequest { | |||
repeated ServerName server_name = 1; | |||
required bool offload = 2; | |||
optional bool match_host_name_only = 3 [default = false]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we want to prescribe a default value in the protobuf. Instead, the service should interpret the absence of a value appropriately.
if (!this.isServerOnline(sn)) { | ||
LOG.warn("Server " + sn + " is not currently online. " | ||
+ "Removing from draining list anyway, as requested."); | ||
LOG.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't clean up all these logger usages in files that you don't otherwise touch. Lets fix them up in a dedicated PR.
Please do reviewers a favor and name the PR following the project conventions -- "jira_id jira_title" |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Closed this PR in favor of a smaller and simpler solution. I will open a different PR with a fresh branch in a little bit. |
Fixes: HBASE-28342
This PR adds support to specifying whether the host should be rejected by the HMaster when it gets decommissioned via the Admin API. If a host is specified as such, the HMaster will ignore the default path of recommissioning it even if it hears back from a RegionServer location (host + port + startcode) in the future as long as the hostname is marked as decommissioned.
TODO: This PR also adds an API for removing a regionserver from the drained server pool which overrides the flag hostname in the draining znode.