[SPARK-16925] Master should call schedule() after all executor exit events, not only failures #14510

JoshRosen · 2016-08-05T21:15:06Z

What changes were proposed in this pull request?

This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes.

As an example of the bug, run

sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }

on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the ExecutorStateChanged event: the current implementation calls schedule() only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the schedule() call.

This patch addresses this by modifying the ExecutorStateChanged to always unconditionally call schedule(). This should be safe because it should always be safe to call schedule(); adding extra schedule() calls can only affect performance and should not introduce correctness bugs.

How was this patch tested?

I added a regression test in DistributedSuite.

zsxwing · 2016-08-05T22:02:45Z

LGTM

SparkQA · 2016-08-05T23:36:36Z

Test build #63286 has finished for PR 14510 at commit c567a7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-08-07T02:28:14Z

I'm going to merge this to master, branch-2.0, and branch-1.6. I have a followup patch to add configuration options for controlling the "remove application that has experienced too many back-to-back executor failures" code path, which I'll submit tomorrow.

…vents, not only failures ## What changes were proposed in this pull request? This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes. As an example of the bug, run ``` sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } ``` on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call. This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs. ## How was this patch tested? I added a regression test in `DistributedSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #14510 from JoshRosen/SPARK-16925. (cherry picked from commit 4f5f9b6) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…vents, not only failures This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes. As an example of the bug, run ``` sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } ``` on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call. This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs. I added a regression test in `DistributedSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #14510 from JoshRosen/SPARK-16925. (cherry picked from commit 4f5f9b6) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…vents, not only failures This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes. As an example of the bug, run ``` sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } ``` on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call. This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs. I added a regression test in `DistributedSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#14510 from JoshRosen/SPARK-16925. (cherry picked from commit 4f5f9b6) Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit c162886)

JoshRosen added 2 commits August 5, 2016 14:04

Add failing regression test for SPARK-16925

d3d5624

Fix bug by always calling schedule().

c567a7e

JoshRosen changed the title ~~[SPARK-16925] Master should call schedule() after all executor exits, not only failures~~ [SPARK-16925] Master should call schedule() after all executor exit events, not only failures Aug 5, 2016

asfgit closed this in 4f5f9b6 Aug 7, 2016

JoshRosen deleted the SPARK-16925 branch August 7, 2016 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16925] Master should call schedule() after all executor exit events, not only failures #14510

[SPARK-16925] Master should call schedule() after all executor exit events, not only failures #14510

Uh oh!

JoshRosen commented Aug 5, 2016 •

edited

Loading

Uh oh!

zsxwing commented Aug 5, 2016

Uh oh!

SparkQA commented Aug 5, 2016

Uh oh!

JoshRosen commented Aug 7, 2016

Uh oh!

Uh oh!

[SPARK-16925] Master should call schedule() after all executor exit events, not only failures #14510

[SPARK-16925] Master should call schedule() after all executor exit events, not only failures #14510

Uh oh!

Conversation

JoshRosen commented Aug 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

zsxwing commented Aug 5, 2016

Uh oh!

SparkQA commented Aug 5, 2016

Uh oh!

JoshRosen commented Aug 7, 2016

Uh oh!

Uh oh!

JoshRosen commented Aug 5, 2016 •

edited

Loading