-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-21456][MESOS] Make the driver failover_timeout configurable #18674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ailoverTimeout. Added a unit test.
Test build #79718 has finished for PR 18674 at commit
|
LGTM |
.doc("Amount of time in seconds that the master will wait to hear from the driver, " + | ||
"during a temporary disconnection, before tearing down all the executors.") | ||
.doubleConf | ||
.createWithDefault(0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zero means infinite time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, zero means 0ns. If the driver is disconnected, the Mesos master will wait 0ns for the driver to reconnect (it will teardown the framework immediately). This is the current default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain that in the .md
file? 0 is generally used for "infinity" so it may be confusing.
@susanxhuynh what are the implications for the dispatcher? The driver runs as a task so the dispatcher will get a task status update earlier if the task has failed correct? |
@skonto There is not much changed in the dispatcher. The main difference is in the executors. First of all, if the user does not set this new config, it will default to the old behavior (failover_timeout=0), and everything remains the same. If the driver is temporarily disconnected from the Mesos master, the master tears down the framework immediately, killing all the executors. If the user sets the new failover_timeout > 0, then if the driver becomes disconnected from the master, the master will wait failover_timeout seconds. It will not kill any executors during this time. If the driver reconnects within the timeout period, then the job continues running uninterrupted. |
In both cases, the dispatcher will get a task status update on the driver when it fails or finishes (no change here). |
LGTM. While checking the docs since this is obviously useful for the --supervise case and it is referenced in the jira ticket: |
@skonto So, my PR is addressing just the issue of the driver temporarily losing connectivity with the master. My PR does change the behavior if the driver fails. If the driver fails, all the executors will fail also, and if the --supervise flag is set, a completely new framework will be launched, with a new framework ID. I think the quote that you found in mesos.apache.org is referring to a different type of framework behavior: if the scheduler fails and restarts, the same framework can continue running (same executors). I do not think that is the way that --supervise is intended to work, but in any case, I was not trying to address this in my PR. |
I saw this issue you created: https://issues.apache.org/jira/browse/SPARK-21458 refering to supervision that's why I made the comment. Currently the supervise flag does not work for that reason of using the same id probably because the failover was always set to 0. So mesos master removes immediately the framework and no time is given for supervision and thus the error. I am checking if it needs to re-use the same executors or not. |
@skonto Do you have any other questions? Are there any changes you want me to make in this PR? |
@susanxhuynh no its fine. @vanzin @srowen could we have a merge pls or if you want to go through it first? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with mesos, so I'll trust you guys. Just minor things to address.
.doc("Amount of time in seconds that the master will wait to hear from the driver, " + | ||
"during a temporary disconnection, before tearing down all the executors.") | ||
.doubleConf | ||
.createWithDefault(0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain that in the .md
file? 0 is generally used for "infinity" so it may be confusing.
@@ -369,6 +369,41 @@ class MesosCoarseGrainedSchedulerBackendSuite extends SparkFunSuite | |||
backend.start() | |||
} | |||
|
|||
test("failover timeout is set in created scheduler driver") { | |||
val failoverTimeoutIn = 3600.0 | |||
initializeSparkConf(Map("spark.mesos.driver.failoverTimeout" -> failoverTimeoutIn.toString)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use DRIVER_FAILOVER_TIMEOUT.key
.
Test build #79778 has finished for PR 18674 at commit
|
@vanzin Thanks for the review. I have made the changes you recommended (documenting the zero default value and using the config key). |
Merging to master. |
@susanxhuynh Thank you for your response. I'll keep you updated on what we learn. |
Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed. Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero. Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458 Added a unit test to make sure the config option is set while creating the scheduler driver. Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master. Author: Susan X. Huynh <xhuynh@mesosphere.com> Closes apache#18674 from susanxhuynh/sh-mesos-failover-timeout.
Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed. Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero. Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458 Added a unit test to make sure the config option is set while creating the scheduler driver. Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master. Author: Susan X. Huynh <xhuynh@mesosphere.com> Closes apache#18674 from susanxhuynh/sh-mesos-failover-timeout.
Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed. Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero. Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458 Added a unit test to make sure the config option is set while creating the scheduler driver. Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master. Author: Susan X. Huynh <xhuynh@mesosphere.com> Closes apache#18674 from susanxhuynh/sh-mesos-failover-timeout.
What changes were proposed in this pull request?
Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed.
Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero.
Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458
How was this patch tested?
Added a unit test to make sure the config option is set while creating the scheduler driver.
Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master.