[SPARK-21456][MESOS] Make the driver failover_timeout configurable #18674

susanxhuynh · 2017-07-18T18:30:56Z

What changes were proposed in this pull request?

Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed.

Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero.

Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458

How was this patch tested?

Added a unit test to make sure the config option is set while creating the scheduler driver.

Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master.

…ailoverTimeout. Added a unit test.

SparkQA · 2017-07-18T18:44:07Z

Test build #79718 has finished for PR 18674 at commit ae3e5bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ArtRand · 2017-07-18T23:44:17Z

LGTM

skonto · 2017-07-19T14:13:29Z

resource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/config.scala

+      .doc("Amount of time in seconds that the master will wait to hear from the driver, " +
+          "during a temporary disconnection, before tearing down all the executors.")
+      .doubleConf
+      .createWithDefault(0.0)


Zero means infinite time?

No, zero means 0ns. If the driver is disconnected, the Mesos master will wait 0ns for the driver to reconnect (it will teardown the framework immediately). This is the current default value.

Can you explain that in the .md file? 0 is generally used for "infinity" so it may be confusing.

skonto · 2017-07-19T15:05:46Z

@susanxhuynh what are the implications for the dispatcher? The driver runs as a task so the dispatcher will get a task status update earlier if the task has failed correct?

susanxhuynh · 2017-07-19T15:16:33Z

@skonto There is not much changed in the dispatcher. The main difference is in the executors.

First of all, if the user does not set this new config, it will default to the old behavior (failover_timeout=0), and everything remains the same. If the driver is temporarily disconnected from the Mesos master, the master tears down the framework immediately, killing all the executors.

If the user sets the new failover_timeout > 0, then if the driver becomes disconnected from the master, the master will wait failover_timeout seconds. It will not kill any executors during this time. If the driver reconnects within the timeout period, then the job continues running uninterrupted.

susanxhuynh · 2017-07-19T15:18:23Z

In both cases, the dispatcher will get a task status update on the driver when it fails or finishes (no change here).

skonto · 2017-07-19T15:28:31Z

LGTM.
So the task updates will not have any effect cool, and the framework tear down will come later on if the timer expires.

While checking the docs since this is obviously useful for the --supervise case and it is referenced in the jira ticket:
http://mesos.apache.org/documentation/latest/high-availability-framework-guide/
"After electing a new leading scheduler, the new leader should reconnect to the Mesos master. When registering with the master, the framework should set the id field in its FrameworkInfo to the ID that was assigned to the failed scheduler instance. This ensures that the master will recognize that the connection does not start a new session, but rather continues (and replaces) the session used by the failed scheduler instance."
So now this is also implicitly fixed:
https://jira.mesosphere.com/browse/DCOS_SPARK-8
https://groups.google.com/a/dcos.io/forum/?utm_medium=email&utm_source=footer#!msg/users/CNRlVXOuVjk/_UcEkx_GAgAJ
So with this patch it is now valid to re-launch stuff with the same fid within that period of time correct?

susanxhuynh · 2017-07-19T15:45:47Z

@skonto So, my PR is addressing just the issue of the driver temporarily losing connectivity with the master. My PR does change the behavior if the driver fails. If the driver fails, all the executors will fail also, and if the --supervise flag is set, a completely new framework will be launched, with a new framework ID. I think the quote that you found in mesos.apache.org is referring to a different type of framework behavior: if the scheduler fails and restarts, the same framework can continue running (same executors). I do not think that is the way that --supervise is intended to work, but in any case, I was not trying to address this in my PR.

skonto · 2017-07-19T15:49:45Z

I saw this issue you created: https://issues.apache.org/jira/browse/SPARK-21458 refering to supervision that's why I made the comment. Currently the supervise flag does not work for that reason of using the same id probably because the failover was always set to 0. So mesos master removes immediately the framework and no time is given for supervision and thus the error. I am checking if it needs to re-use the same executors or not.

susanxhuynh · 2017-07-19T18:17:20Z

@skonto Do you have any other questions? Are there any changes you want me to make in this PR?

skonto · 2017-07-19T20:32:45Z

@susanxhuynh no its fine. @vanzin @srowen could we have a merge pls or if you want to go through it first?

vanzin

Not familiar with mesos, so I'll trust you guys. Just minor things to address.

vanzin · 2017-07-19T20:50:55Z

resource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/config.scala

+      .doc("Amount of time in seconds that the master will wait to hear from the driver, " +
+          "during a temporary disconnection, before tearing down all the executors.")
+      .doubleConf
+      .createWithDefault(0.0)


Can you explain that in the .md file? 0 is generally used for "infinity" so it may be confusing.

vanzin · 2017-07-19T20:51:40Z

...scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala

@@ -369,6 +369,41 @@ class MesosCoarseGrainedSchedulerBackendSuite extends SparkFunSuite
    backend.start()
  }

+  test("failover timeout is set in created scheduler driver") {
+    val failoverTimeoutIn = 3600.0
+    initializeSparkConf(Map("spark.mesos.driver.failoverTimeout" -> failoverTimeoutIn.toString))


Use DRIVER_FAILOVER_TIMEOUT.key.

… the config key.

SparkQA · 2017-07-19T22:01:31Z

Test build #79778 has finished for PR 18674 at commit f4a001f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

susanxhuynh · 2017-07-19T22:08:15Z

@vanzin Thanks for the review. I have made the changes you recommended (documenting the zero default value and using the config key).

vanzin · 2017-07-19T22:10:42Z

Merging to master.

lukasbradley · 2017-07-31T13:33:37Z

@susanxhuynh Thank you for your response. I'll keep you updated on what we learn.

Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed. Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero. Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458 Added a unit test to make sure the config option is set while creating the scheduler driver. Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master. Author: Susan X. Huynh <xhuynh@mesosphere.com> Closes apache#18674 from susanxhuynh/sh-mesos-failover-timeout.

Made the driver failover_timeout configurable as spark.mesos.driver.f…

ae3e5bd

…ailoverTimeout. Added a unit test.

susanxhuynh mentioned this pull request Jul 18, 2017

[SPARK-478] Make driver failover_timeout configurable d2iq-archive/spark-build#161

Merged

susanxhuynh mentioned this pull request Jul 19, 2017

[SPARK-4899][MESOS] Support for Checkpointing on Coarse Grained Mode #17750

Closed

skonto reviewed Jul 19, 2017

View reviewed changes

vanzin reviewed Jul 19, 2017

View reviewed changes

Addressed review comments: (1) explain default value of zero, (2) use…

f4a001f

… the config key.

asfgit closed this in c42ef95 Jul 19, 2017

susanxhuynh deleted the sh-mesos-failover-timeout branch September 19, 2017 19:18

[SPARK-21456][MESOS] Make the driver failover_timeout configurable #18674

[SPARK-21456][MESOS] Make the driver failover_timeout configurable #18674

Uh oh!

Conversation

susanxhuynh commented Jul 18, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 18, 2017

Uh oh!

ArtRand commented Jul 18, 2017

Uh oh!

skonto Jul 19, 2017

Choose a reason for hiding this comment

Uh oh!

susanxhuynh Jul 19, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 19, 2017

Choose a reason for hiding this comment

Uh oh!

skonto commented Jul 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

susanxhuynh commented Jul 19, 2017

Uh oh!

susanxhuynh commented Jul 19, 2017

Uh oh!

skonto commented Jul 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

susanxhuynh commented Jul 19, 2017

Uh oh!

skonto commented Jul 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

susanxhuynh commented Jul 19, 2017

Uh oh!

skonto commented Jul 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 19, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 19, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 19, 2017

Uh oh!

susanxhuynh commented Jul 19, 2017

Uh oh!

vanzin commented Jul 19, 2017

Uh oh!

lukasbradley commented Jul 31, 2017

Uh oh!

Uh oh!

skonto commented Jul 19, 2017 •

edited

Loading

skonto commented Jul 19, 2017 •

edited

Loading

skonto commented Jul 19, 2017 •

edited

Loading

skonto commented Jul 19, 2017 •

edited

Loading