Skip to content

[YARN-6483] Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat #289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

juanrh
Copy link

@juanrh juanrh commented Nov 7, 2017

This is an alternative approach to https://issues.apache.org/jira/browse/YARN-3224 for notifying all affected application masters when a node transitions into the DECOMMISSIONING state. This change modifies the AllocateResponse that the YARN Resource Manager uses to respond to heartbeat request from application masters, to add any node that has transitioned to DECOMMISSIONING state since the last heartbeat to the list of NodeReport objects that is part of the AllocateResponse object. We also add a new field to each NodeReport to add the decommission timeout for DECOMMISSIONING nodes, thus covering the same functionality of the original proposal in YARN-3224.

Juan Rodriguez Hortala added 4 commits October 27, 2017 16:59
adapt test to decommission timeout checks being independent
of received heartbeats
in this version that is the only way to specify a timeout in the
excludes file
replace dynamic conf by using the configuration
passed by AdminService
}

@Private
@Unstable
public static NodeReport newInstance(NodeId nodeId, NodeState nodeState,
String httpAddress, String rackName, Resource used, Resource capability,
int numContainers, String healthReport, long lastHealthReportTime,
Set<String> nodeLabels) {
Set<String> nodeLabels, Integer decommissioningTimeout) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use int ? Am not confortable using Integer - given the sometimes arbitrary boxing/unboxing rules.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Integer comes from RMNode.getDecommissioningTimeout() that was already returning an Integer before this patch, because only nodes in DECOMMISSIONING state have an associated decommission timeout, so null is used to express absent timeout. In this patch RMNode.getDecommissioningTimeout() is used in DefaultAMSProcessor.handleNodeUpdates to get the argument decommissioningTimeout for BuilderUtils.newNodeReport. If we use a int for decommissioningTimeout in NodeReport.newInstance I think we should also use an int for the same argument in BuilderUtils.newNodeReport for uniformity, which implies a conversion from null to -1 in DefaultAMSProcessor.handleNodeUpdates.

So I think we should either keep using Integer decommissioningTimeout everywhere, enconding absent timeout with null, or use int decommissioningTimeout everywhere, enconding absent timeout with a negative timeout, which is coherent with message NodeReportProto using an unsigned int for decommissioning_timeout. What do you think about these 2 alternatives?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - sorry, just noticed that. maybe stick with Integer (although I dont like it much). Maybe we can raise another JIRA to fix it properly via your second suggestion.

// could take any required actions.
rmNode.context.getDispatcher().getEventHandler().handle(
new NodesListManagerEvent(
NodesListManagerEventType.NODE_USABLE, rmNode));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.. don't think we should be sending NODE_USABLE event here.. since technically, it is not usable.
Maybe consider creating a new NODE_DECOMMISSIONING event ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't very sure about using NODE_USABLE, but while I was making the changes to follow your suggestion, I have found that in the current code TestRMNodeTransitions.testResourceUpdateOnDecommissioningNode is asserting that NodesListManagerEventType.NODE_USABLE is expected for a node that transitions to DECOMMISSIONING. Also, NodesListManagerEventType is transformed into the corresponding RMAppNodeUpdateType in NodesListManager.handle to build a RMAppNodeUpdateEvent that is processed in RMAppImpl.processNodeUpdate which just uses the RMAppNodeUpdateType for logging.

So it looks like it is ok to use NodesListManagerEventType.NODE_USABLE for nodes in decommissioning state. Do you still think it's worth adding some additional value for NodesListManagerEventType and RMAppNodeUpdateType?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should make intentions explicit - having a separate event type might make the code cleaner and easier to follow - rather than overloading. It could be that the assumption in the testcase is wrong (will have to double check though), in which case - it is perfectly alright to modify the testcasse with the new event.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I have added a new commit for adding a new value NODE_DECOMMISSIONING for NodesListManagerEventType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - looking at the patch.. can you also attach a consolidated patch on the JIRA ? So as to kick Jenkins.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just attached the patch. Thanks a lot for taking a look!

@juanrh
Copy link
Author

juanrh commented Nov 27, 2017

Pushed in b46ca7e

@juanrh juanrh closed this Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants