-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDDS-1982 Extend SCMNodeManager to support decommission and maintenance states #1344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/label ozone |
💔 -1 overall
This message was automatically generated. |
This got a few test failures. TestSCMNodeMetrics was a legitimate failure, I have fixed it. TestSecureContainerServer.testClientServerRatisGrpc() was failing on trunk, but has now been fixed. TestBlockOutputStreamWithFailures.testWatchForCommitDatanodeFailure() seems flaky. It has passed and failed a few times locally. |
e22b8d2
to
a131589
Compare
a131589
to
b489d6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks amazing; much cleaner and far better than I thought it will be. I can emphatically +1 this. You are in the right direction. Thank you for doing this so thoughtfully.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeStateManager.java
Show resolved
Hide resolved
nodeOpStateSM.addTransition( | ||
NodeOperationalState.ENTERING_MAINTENANCE, | ||
NodeOperationalState.IN_MAINTENANCE, | ||
NodeOperationStateEvent.ENTER_MAINTENANCE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an English point of view, this is slightly confusing. But I see why :)
NodeOperationStateEvent.ENTER_MAINTENANCE); | ||
nodeOpStateSM.addTransition( | ||
NodeOperationalState.IN_MAINTENANCE, NodeOperationalState.IN_SERVICE, | ||
NodeOperationStateEvent.RETURN_TO_SERVICE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle the edge of timeOut, Maintenance might have time out -- that is I put the maintenance for one day and forget about it. Or is that handled outside the state machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't considered where to store that as yet. Probably it will be outside of the state machine, but need to consider where it fits in. Perhaps in NodeStatus, but that would change that object from being immutable, to carrying a time.
We will need some sort of decommission / maintenance mode monitor, probably separate from the heartbeat monitor. The decomm monitor will need to check when all blocks are replicated etc, so it could also keep track of the node maintenance timeout and hence switch the node to 'IN_SERVICE + DEAD" if it is dead and the timeout expires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along with your consideration, do we need an edge called Timeout that leads from IN_MAINTENANCE to IN_SERVICE? or do you plan to send in RETURN_TO_SERVICE event when there is a timeout? Either works, I was wondering if we should capture the time out edge in the state machine at all ?
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeStateManager.java
Outdated
Show resolved
Hide resolved
updateNodeState(node, healthyNodeCondition, state, | ||
NodeLifeCycleEvent.RESURRECT); | ||
break; | ||
default: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need this loop change, but it does make code reading simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop didn't need to change for this change, but it seemed to be a double loop when it didn't really need to be, and was doing extra lookups from the NodeStateMap, so this makes it cleaner to read and slightly more efficient too.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeStatus.java
Show resolved
Hide resolved
return new NodeStatus(HddsProtos.NodeOperationalState.IN_SERVICE, | ||
HddsProtos.NodeState.DEAD); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am presuming that you have to define the whole cross product at some point, but right now this is all we need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I got tired of typing the whole new NodeStatus(...) and decided to try adding the static methods. It definitely makes the code cleaner, but the cross product worries me. At the moment its only 5 * 3 = 15 states, but what if we add a 3rd status or a couple more states. The number of helper methods will get out of control. We can see how it develops I guess.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java
Show resolved
Hide resolved
@@ -185,7 +190,7 @@ public int getNodeCount(NodeState nodestate) { | |||
@Override | |||
public NodeState getNodeState(DatanodeDetails datanodeDetails) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to write alternate version which take the operational status too ..since these calls are internal. Again, not something that need to be done in this patch. I am just writing down things as I see them. Please don't treat any of my suggests as a code review thought. More like, something that might be useful in the long run is more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the 'external interface' of SCMNodeManager will need to change but I want to get these changes to be good internally before we push them up the stack.
Thanks for taking the time to review this WIP. Glad to hear this is going in the correct direction so I will look to tidy things up and then we can consider the next step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the change looks pretty good, +1 on the approach.
@@ -43,7 +45,7 @@ | |||
/** | |||
* Represents the current state of node. | |||
*/ | |||
private final ConcurrentHashMap<NodeState, Set<UUID>> stateMap; | |||
private final ConcurrentHashMap<UUID, NodeStatus> stateMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stateMap
is no longer required, we can move NodeStatus
inside DatanodeInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it makes sense to have a field inside DatanodeInfo of type NodeStatus, so we can always pass the states around as a pair, or should we add two individual fields to DatanodeInfo - nodeHealth and nodeOperationalState?
Also, one other thing to consider, is nodeStateMap originally kept a list of healthy, stale and dead, so it was possible to quickly return all nodes in that state. However now, we need to iterate over the whole list to find those nodes. One reason for this, is that we have 15 different states now instead of 3. If we move nodeStatus into datanodeInfo, it would be more difficult to optimise this later if needed. However it would simplify things if we simply remove this stateMap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to have NodeStatus
inside DatanodeInfo
rather than having two separate fields.
Yes, stateMap
helped us to easily get the list/count of nodes in a specific state, but with the current changes it is not straight forward to maintain state -> list of nodes
. In any case we will be iterating over all the available nodes to get list of nodes in a given state.
The number of nodes in a cluster should not go beyond 3-4 order of magnitude. We can re-visit and optimize this, if we run into any performance issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if you have 15x states, the number of nodes is less. if you have 100 nodes, there are only 1500 states, and if you have 1000 nodes, it is 15000 states. It is still trivial to keep these in memory. Here is the real kicker, just like we decided not to write all cross products for the NodeState static functions, we will end up needing lists of only frequently accessed pattern (in mind that would be (in_service, healthy). All other node queries can be retrieved by iterating the lists as needed.
Just a note; Originally DatanodeInfo was based on the HDFS code. Then I think we copied and created our own structure. At this point, diverging should not be a big deal is what I think. |
e52e3b8
to
1053f63
Compare
💔 -1 overall
This message was automatically generated. |
@@ -417,9 +451,12 @@ private SCMNodeStat getNodeStatInternal(DatanodeDetails datanodeDetails) { | |||
|
|||
@Override | |||
public Map<String, Integer> getNodeCount() { | |||
// TODO - This does not consider decom, maint etc. | |||
Map<String, Integer> nodeCountMap = new HashMap<String, Integer>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not <NodeState, Integer>? It makes it easier to consume for the caller in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code had Map<String, Integer>, but I agree it would be better with <NodeState, Integer> or <NodeStatus, Integer>. I plan to leave this as is for now, as this method is used only for JMX right now, and I plan to split that out into a separate change via HDDS-2113 as there are some open questions there.
*/ | ||
private List<DatanodeInfo> filterNodes( | ||
NodeOperationalState opState, NodeState health) { | ||
if (opState != null && health != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write Line 395-440 with one simple stream().filter? Nothing wrong with code itself but just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not really looked into the Streams API before, but I change the code to use streams and it does make it easier to follow, so I have made this change. I still kept the IF statements at the start of the method as if both params are null we can just return the entire list with no searching and if both are non-null we can search using the NodeStatus which should be slightly more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please be aware that stream.filter kind of patterns have a huge overhead over normal for. If this code is going to be in any sort of critical path, it is better for the code to stay normal for.
Please see some fixes made by todd lipcon, in HDFS because of this issue.
💔 -1 overall
This message was automatically generated. |
The failing unit test passes locally and the integration tests which failed, are flaky, I think. I will push the change to fix the style issue and see how the re-test goes. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
LGTM If I understood well everybody agreed with this approach and AFAIK all of the comments are addressed. @anuengineer @nandakumar131 please let us now if you have any further comments. I am planning to commit it tomorrow if no more objections. I think we can commit it to the trunk, I am not sure if we need a separated branch (let me know if you prefer a feature branch).
|
@sodonnel Can you please rebase and push (some of the integration tests are fixed on trunk, we can double check the test results with a new, updated push) |
…rotobuf definition
…tatus) rather than just NodeState. Still WIP with several TODOs
…code to use the new NodeStatus class upto the external boundry where SCMNodeManager interfaces with other components
…state2eventmap so events are only fired for IN_SERVICE transitions. Prevent updating the nodeStatus directly in NodeStateMap to avoid race conditions. As NodeStatus contains two states, it could lead to lost updates. Instead allow each state to be updated seperately under the write lock.
…d the stateMap from NodeStateMap which is no longer needed
…e retrieved from the NodeStateMap
…e NodeStatus and fixed all the compile errors resulting from that
… it was causing compile errors
cc51696
to
20bef11
Compare
💔 -1 overall
This message was automatically generated. |
Let us commit this into a branch, not into Trunk. Thanks |
@nandakumar131 @elek @swagle Thank you for all the comments and discussion. @sodonnel Thank you for the contribution. I have committed this patch to the HDDS-1880-Decom branch. |
Remove the existing decommission states from the protobuf definition.
At this stage, this PR is really a test to see if the build passes with these states removed.