[YARN-6483] Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat #289

juanrh · 2017-11-07T18:59:48Z

This is an alternative approach to https://issues.apache.org/jira/browse/YARN-3224 for notifying all affected application masters when a node transitions into the DECOMMISSIONING state. This change modifies the AllocateResponse that the YARN Resource Manager uses to respond to heartbeat request from application masters, to add any node that has transitioned to DECOMMISSIONING state since the last heartbeat to the list of NodeReport objects that is part of the AllocateResponse object. We also add a new field to each NodeReport to add the decommission timeout for DECOMMISSIONING nodes, thus covering the same functionality of the original proposal in YARN-3224.

…G state

adapt test to decommission timeout checks being independent of received heartbeats

in this version that is the only way to specify a timeout in the excludes file

replace dynamic conf by using the configuration passed by AdminService

xslogic · 2017-11-15T20:24:20Z

...hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeReport.java

  }

  @Private
  @Unstable
  public static NodeReport newInstance(NodeId nodeId, NodeState nodeState,
      String httpAddress, String rackName, Resource used, Resource capability,
      int numContainers, String healthReport, long lastHealthReportTime,
-      Set<String> nodeLabels) {
+      Set<String> nodeLabels, Integer decommissioningTimeout) {


Can we use int ? Am not confortable using Integer - given the sometimes arbitrary boxing/unboxing rules.

This Integer comes from RMNode.getDecommissioningTimeout() that was already returning an Integer before this patch, because only nodes in DECOMMISSIONING state have an associated decommission timeout, so null is used to express absent timeout. In this patch RMNode.getDecommissioningTimeout() is used in DefaultAMSProcessor.handleNodeUpdates to get the argument decommissioningTimeout for BuilderUtils.newNodeReport. If we use a int for decommissioningTimeout in NodeReport.newInstance I think we should also use an int for the same argument in BuilderUtils.newNodeReport for uniformity, which implies a conversion from null to -1 in DefaultAMSProcessor.handleNodeUpdates.

So I think we should either keep using Integer decommissioningTimeout everywhere, enconding absent timeout with null, or use int decommissioningTimeout everywhere, enconding absent timeout with a negative timeout, which is coherent with message NodeReportProto using an unsigned int for decommissioning_timeout. What do you think about these 2 alternatives?

Yeah - sorry, just noticed that. maybe stick with Integer (although I dont like it much). Maybe we can raise another JIRA to fix it properly via your second suggestion.

xslogic · 2017-11-15T20:35:07Z

...cemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java

+      // could take any required actions.
+      rmNode.context.getDispatcher().getEventHandler().handle(
+          new NodesListManagerEvent(
+              NodesListManagerEventType.NODE_USABLE, rmNode));


Hmmm.. don't think we should be sending NODE_USABLE event here.. since technically, it is not usable.
Maybe consider creating a new NODE_DECOMMISSIONING event ?

I wasn't very sure about using NODE_USABLE, but while I was making the changes to follow your suggestion, I have found that in the current code TestRMNodeTransitions.testResourceUpdateOnDecommissioningNode is asserting that NodesListManagerEventType.NODE_USABLE is expected for a node that transitions to DECOMMISSIONING. Also, NodesListManagerEventType is transformed into the corresponding RMAppNodeUpdateType in NodesListManager.handle to build a RMAppNodeUpdateEvent that is processed in RMAppImpl.processNodeUpdate which just uses the RMAppNodeUpdateType for logging.

So it looks like it is ok to use NodesListManagerEventType.NODE_USABLE for nodes in decommissioning state. Do you still think it's worth adding some additional value for NodesListManagerEventType and RMAppNodeUpdateType?

I feel we should make intentions explicit - having a separate event type might make the code cleaner and easier to follow - rather than overloading. It could be that the assumption in the testcase is wrong (will have to double check though), in which case - it is perfectly alright to modify the testcasse with the new event.

Makes sense, I have added a new commit for adding a new value NODE_DECOMMISSIONING for NodesListManagerEventType

Thanks - looking at the patch.. can you also attach a consolidated patch on the JIRA ? So as to kick Jenkins.

I have just attached the patch. Thanks a lot for taking a look!

use it for notifying nodes transitions to DECOMMISSIONG state

so it works ok for older versions

juanrh · 2017-11-27T19:29:21Z

Pushed in b46ca7e

…WithShuffledStreamSpecs and testGeneratePlanIdWithDifferentStreamSpec More details are in https://issues.apache.org/jira/browse/SAMZA-1410. gradlew build and test passed. Author: Fred Ji <haifeng.ji@gmail.com> Reviewers: Xinyu Liu <xinyu@apache.org> Closes apache#289 from fredji97/assertNotEquals

Juan Rodriguez Hortala added 4 commits October 27, 2017 16:59

Notify affected Application Masters when a node enters DECOMMISSIONIN…

c85b701

…G state

Add decommission timeout field to NodeReport

ed2561e

fix TestClientRMService

506f7de

adapt test to decommission timeout checks being independent of received heartbeats

use xml format for excludes files with timeouts

9a82f31

in this version that is the only way to specify a timeout in the excludes file

juanrh mentioned this pull request Nov 7, 2017

[WIP][SPARK-20628][CORE] Blacklist nodes when they transition to DECOMMISSIONING state in YARN apache/spark#19267

Closed

load dynamic timeout like in hadoop trunk

0ad0508

replace dynamic conf by using the configuration passed by AdminService

xslogic reviewed Nov 15, 2017

View reviewed changes

Juan Rodriguez Hortala added 8 commits November 17, 2017 10:49

add new NodesListManagerEventType.NODE_DECOMMISSIONING

258e29a

use it for notifying nodes transitions to DECOMMISSIONG state

add default implementation for new methods in NodeReport

d881d06

so it works ok for older versions

revert unnecessary changes in RMNodeDecommissioningEvent

3d4d0cd

add optional NodeUpdateType field to NodeReport

8705807

add assertions for NodeReport.getNodeUpdateType

cb503b0

fix boxing-unboxing bug

af2bf85

fix checkstyle errors

32c441d

fix whitespaces

880a4e6

juanrh closed this Nov 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[YARN-6483] Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat #289

[YARN-6483] Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat #289

Uh oh!

juanrh commented Nov 7, 2017

Uh oh!

xslogic Nov 15, 2017

Uh oh!

juanrh Nov 15, 2017

Uh oh!

xslogic Nov 16, 2017

Uh oh!

xslogic Nov 15, 2017

Uh oh!

juanrh Nov 15, 2017

Uh oh!

xslogic Nov 16, 2017

Uh oh!

juanrh Nov 17, 2017

Uh oh!

xslogic Nov 17, 2017

Uh oh!

juanrh Nov 17, 2017

Uh oh!

juanrh commented Nov 27, 2017

Uh oh!

Uh oh!

[YARN-6483] Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat #289

[YARN-6483] Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat #289

Uh oh!

Conversation

juanrh commented Nov 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juanrh commented Nov 27, 2017

Uh oh!

Uh oh!