Skip to content

[SPARK-16925] Master should call schedule() after all executor exit events, not only failures #14510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

JoshRosen
Copy link
Contributor

@JoshRosen JoshRosen commented Aug 5, 2016

What changes were proposed in this pull request?

This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes.

As an example of the bug, run

sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }

on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the ExecutorStateChanged event: the current implementation calls schedule() only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the schedule() call.

This patch addresses this by modifying the ExecutorStateChanged to always unconditionally call schedule(). This should be safe because it should always be safe to call schedule(); adding extra schedule() calls can only affect performance and should not introduce correctness bugs.

How was this patch tested?

I added a regression test in DistributedSuite.

@JoshRosen JoshRosen changed the title [SPARK-16925] Master should call schedule() after all executor exits, not only failures [SPARK-16925] Master should call schedule() after all executor exit events, not only failures Aug 5, 2016
@zsxwing
Copy link
Member

zsxwing commented Aug 5, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Aug 5, 2016

Test build #63286 has finished for PR 14510 at commit c567a7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

I'm going to merge this to master, branch-2.0, and branch-1.6. I have a followup patch to add configuration options for controlling the "remove application that has experienced too many back-to-back executor failures" code path, which I'll submit tomorrow.

asfgit pushed a commit that referenced this pull request Aug 7, 2016
…vents, not only failures

## What changes were proposed in this pull request?

This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes.

As an example of the bug, run

```
sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
```

on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call.

This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs.

## How was this patch tested?

I added a regression test in `DistributedSuite`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14510 from JoshRosen/SPARK-16925.

(cherry picked from commit 4f5f9b6)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
@asfgit asfgit closed this in 4f5f9b6 Aug 7, 2016
@JoshRosen JoshRosen deleted the SPARK-16925 branch August 7, 2016 02:41
asfgit pushed a commit that referenced this pull request Aug 7, 2016
…vents, not only failures

This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes.

As an example of the bug, run

```
sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
```

on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call.

This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs.

I added a regression test in `DistributedSuite`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14510 from JoshRosen/SPARK-16925.

(cherry picked from commit 4f5f9b6)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
zzcclp pushed a commit to zzcclp/spark that referenced this pull request Aug 8, 2016
…vents, not only failures

This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes.

As an example of the bug, run

```
sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
```

on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call.

This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs.

I added a regression test in `DistributedSuite`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#14510 from JoshRosen/SPARK-16925.

(cherry picked from commit 4f5f9b6)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
(cherry picked from commit c162886)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants