Skip to content

[SPARK-21456][MESOS] Make the driver failover_timeout configurable #18674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

susanxhuynh
Copy link
Contributor

What changes were proposed in this pull request?

Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed.

Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero.

Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458

How was this patch tested?

Added a unit test to make sure the config option is set while creating the scheduler driver.

Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master.

@SparkQA
Copy link

SparkQA commented Jul 18, 2017

Test build #79718 has finished for PR 18674 at commit ae3e5bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ArtRand
Copy link

ArtRand commented Jul 18, 2017

LGTM

.doc("Amount of time in seconds that the master will wait to hear from the driver, " +
"during a temporary disconnection, before tearing down all the executors.")
.doubleConf
.createWithDefault(0.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero means infinite time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, zero means 0ns. If the driver is disconnected, the Mesos master will wait 0ns for the driver to reconnect (it will teardown the framework immediately). This is the current default value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain that in the .md file? 0 is generally used for "infinity" so it may be confusing.

@skonto
Copy link
Contributor

skonto commented Jul 19, 2017

@susanxhuynh what are the implications for the dispatcher? The driver runs as a task so the dispatcher will get a task status update earlier if the task has failed correct?

@susanxhuynh
Copy link
Contributor Author

@skonto There is not much changed in the dispatcher. The main difference is in the executors.

First of all, if the user does not set this new config, it will default to the old behavior (failover_timeout=0), and everything remains the same. If the driver is temporarily disconnected from the Mesos master, the master tears down the framework immediately, killing all the executors.

If the user sets the new failover_timeout > 0, then if the driver becomes disconnected from the master, the master will wait failover_timeout seconds. It will not kill any executors during this time. If the driver reconnects within the timeout period, then the job continues running uninterrupted.

@susanxhuynh
Copy link
Contributor Author

In both cases, the dispatcher will get a task status update on the driver when it fails or finishes (no change here).

@skonto
Copy link
Contributor

skonto commented Jul 19, 2017

LGTM.
So the task updates will not have any effect cool, and the framework tear down will come later on if the timer expires.

While checking the docs since this is obviously useful for the --supervise case and it is referenced in the jira ticket:
http://mesos.apache.org/documentation/latest/high-availability-framework-guide/
"After electing a new leading scheduler, the new leader should reconnect to the Mesos master. When registering with the master, the framework should set the id field in its FrameworkInfo to the ID that was assigned to the failed scheduler instance. This ensures that the master will recognize that the connection does not start a new session, but rather continues (and replaces) the session used by the failed scheduler instance."
So now this is also implicitly fixed:
https://jira.mesosphere.com/browse/DCOS_SPARK-8
https://groups.google.com/a/dcos.io/forum/?utm_medium=email&utm_source=footer#!msg/users/CNRlVXOuVjk/_UcEkx_GAgAJ
So with this patch it is now valid to re-launch stuff with the same fid within that period of time correct?

@susanxhuynh
Copy link
Contributor Author

@skonto So, my PR is addressing just the issue of the driver temporarily losing connectivity with the master. My PR does change the behavior if the driver fails. If the driver fails, all the executors will fail also, and if the --supervise flag is set, a completely new framework will be launched, with a new framework ID. I think the quote that you found in mesos.apache.org is referring to a different type of framework behavior: if the scheduler fails and restarts, the same framework can continue running (same executors). I do not think that is the way that --supervise is intended to work, but in any case, I was not trying to address this in my PR.

@skonto
Copy link
Contributor

skonto commented Jul 19, 2017

I saw this issue you created: https://issues.apache.org/jira/browse/SPARK-21458 refering to supervision that's why I made the comment. Currently the supervise flag does not work for that reason of using the same id probably because the failover was always set to 0. So mesos master removes immediately the framework and no time is given for supervision and thus the error. I am checking if it needs to re-use the same executors or not.

@susanxhuynh
Copy link
Contributor Author

@skonto Do you have any other questions? Are there any changes you want me to make in this PR?

@skonto
Copy link
Contributor

skonto commented Jul 19, 2017

@susanxhuynh no its fine. @vanzin @srowen could we have a merge pls or if you want to go through it first?

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with mesos, so I'll trust you guys. Just minor things to address.

.doc("Amount of time in seconds that the master will wait to hear from the driver, " +
"during a temporary disconnection, before tearing down all the executors.")
.doubleConf
.createWithDefault(0.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain that in the .md file? 0 is generally used for "infinity" so it may be confusing.

@@ -369,6 +369,41 @@ class MesosCoarseGrainedSchedulerBackendSuite extends SparkFunSuite
backend.start()
}

test("failover timeout is set in created scheduler driver") {
val failoverTimeoutIn = 3600.0
initializeSparkConf(Map("spark.mesos.driver.failoverTimeout" -> failoverTimeoutIn.toString))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use DRIVER_FAILOVER_TIMEOUT.key.

@SparkQA
Copy link

SparkQA commented Jul 19, 2017

Test build #79778 has finished for PR 18674 at commit f4a001f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@susanxhuynh
Copy link
Contributor Author

@vanzin Thanks for the review. I have made the changes you recommended (documenting the zero default value and using the config key).

@vanzin
Copy link
Contributor

vanzin commented Jul 19, 2017

Merging to master.

@asfgit asfgit closed this in c42ef95 Jul 19, 2017
@lukasbradley
Copy link

@susanxhuynh Thank you for your response. I'll keep you updated on what we learn.

susanxhuynh added a commit to d2iq-archive/spark that referenced this pull request Jul 31, 2017
Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed.

Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero.

Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458

Added a unit test to make sure the config option is set while creating the scheduler driver.

Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master.

Author: Susan X. Huynh <xhuynh@mesosphere.com>

Closes apache#18674 from susanxhuynh/sh-mesos-failover-timeout.
susanxhuynh added a commit to d2iq-archive/spark that referenced this pull request Aug 1, 2017
Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed.

Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero.

Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458

Added a unit test to make sure the config option is set while creating the scheduler driver.

Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master.

Author: Susan X. Huynh <xhuynh@mesosphere.com>

Closes apache#18674 from susanxhuynh/sh-mesos-failover-timeout.
@susanxhuynh susanxhuynh deleted the sh-mesos-failover-timeout branch September 19, 2017 19:18
susanxhuynh added a commit to d2iq-archive/spark that referenced this pull request Jan 8, 2018
Current behavior: in Mesos cluster mode, the driver failover_timeout is set to zero. If the driver temporarily loses connectivity with the Mesos master, the framework will be torn down and all executors killed.

Proposed change: make the failover_timeout configurable via a new option, spark.mesos.driver.failoverTimeout. The default value is still zero.

Note: with non-zero failover_timeout, an explicit teardown is needed in some cases. This is captured in https://issues.apache.org/jira/browse/SPARK-21458

Added a unit test to make sure the config option is set while creating the scheduler driver.

Ran an integration test with mesosphere/spark showing that with a non-zero failover_timeout the Spark job finishes after a driver is disconnected from the master.

Author: Susan X. Huynh <xhuynh@mesosphere.com>

Closes apache#18674 from susanxhuynh/sh-mesos-failover-timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants