-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YARN-8470. Fix a NPE in identifyContainersToPreemptOnNode() #416
base: trunk
Are you sure you want to change the base?
Conversation
I encountered this issue while running 3.1.0: ``` 2018-09-10 13:42:39,437 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Container container_1536156801471_0071_01_000055 completed with event FINISHED, but corresponding RMContainer doesn't exist. 2018-09-10 13:42:39,881 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, FSPreemptionThread, that exited unexpectedly: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptOnNode(FSPreemptionThread.java:207) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptForOneContainer(FSPreemptionThread.java:161) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreempt(FSPreemptionThread.java:121) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:81) 2018-09-10 13:42:39,886 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down the resource manager. 2018-09-10 13:42:39,891 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: a critical thread, FSPreemptionThread, that exited unexpectedly: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptOnNode(FSPreemptionThread.java:207) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptForOneContainer(FSPreemptionThread.java:161) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreempt(FSPreemptionThread.java:121) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:81) ``` I'm guessing a better fix would be to synchronise the removal of applications, but this simple patch should be an improvement IMO. Signed-off-by: George G <git@gg7.io>
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
if (app == null) { | ||
// e.g. "INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Container container_1536156801471_0071_01_000096 completed with event FINISHED, but corresponding RMContainer doesn't exist." | ||
LOG.warn("app == null, giving up in identifyContainersToPreemptOnNode()"); | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just continue
instead of returning null
since we might still be able to find preemptable containers on this node?
When zookeeper session failures occur in a stream processor, leaves the group(zkClient is closed) and joins the group again. The last step in that shutdown sequence is zkClient.close(). In some scenarios, it throws the following exception, org.I0Itec.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException at org.I0Itec.zkclient.ZkClient.close(ZkClient.java:1278) at org.apache.samza.zk.ZkControllerImpl.stop(ZkControllerImpl.java:92) at org.apache.samza.zk.ZkJobCoordinator.stop(ZkJobCoordinator.java:141) In existing implementation this is not handled, there by killing the stream processor. The following codepath triggers this exception: `StreamProcessor.stop -> ZkJobCoordinator.stop() -> zkController.stop() -> zkUtils.close` This exception causes the integration test to fail occasionally and can cause LocalApplicationRunner.waitForFinish method call to block indefinitely(since this callback event success, updates the latch state required for waitForFinish to end). Author: Shanthoosh Venkataraman <svenkataraman@linkedin.com> Reviewers: Jagadish <jagadish@apache.org> Closes apache#416 from shanthoosh/zk_utils_close
I encountered this issue while running 3.1.0:
I'm guessing a better fix would be to synchronise the removal of applications, but this simple patch should be an improvement IMO.