Skip to content

[SPARK-17316][Core] Make CoarseGrainedSchedulerBackend.removeExecutor non-blocking #14882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Aug 30, 2016

What changes were proposed in this pull request?

StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint.

This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value).

How was this patch tested?

Jenkins unit tests.

@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #64668 has finished for PR 14882 at commit dac8aa9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Aug 30, 2016

/cc @vanzin

@@ -407,7 +408,7 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
}

// Called by subclasses when notified of a lost worker
def removeExecutor(executorId: String, reason: ExecutorLossReason) {
protected def removeExecutor(executorId: String, reason: ExecutorLossReason) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to protected since it's called only by subclasses

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you're here can you fix this to have a proper signature (add return type)?

@zsxwing zsxwing changed the title [SPARK-17316] Don't block StandaloneSchedulerBackend.executorRemoved [SPARK-17316][Core] Don't block StandaloneSchedulerBackend.executorRemoved Aug 30, 2016
@@ -416,6 +417,12 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
}
}

protected def removeExecutorAsync(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we can't just make removeExecutor always use the async API?

It doesn't look like any of the call sites actually relies on it being synchronous.

@vanzin
Copy link
Contributor

vanzin commented Aug 30, 2016

If removeExecutor can be made asynchronous I'd prefer that, otherwise this LGTM.

Another crazy option, given that the RPC layer is reliable, is to use RpcEnv.send, since the return value here is not used for anything.

@zsxwing
Copy link
Member Author

zsxwing commented Aug 31, 2016

Yeah, agreed that it's safe to just use send. Updated.

@zsxwing zsxwing changed the title [SPARK-17316][Core] Don't block StandaloneSchedulerBackend.executorRemoved [SPARK-17316][Core] Make CoarseGrainedSchedulerBackend.removeExecutor non-blocking Aug 31, 2016
*/
protected def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = {
// Only log the failure since we don't care about the result.
driverEndpoint.ask(RemoveExecutor(executorId, reason)).onFailure { case t =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to use ask since there are some places using ask(RemoveExecutor).

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64683 has finished for PR 14882 at commit 0cfc282.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64681 has finished for PR 14882 at commit fa724dc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64677 has finished for PR 14882 at commit 142de67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64679 has finished for PR 14882 at commit dcd15e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64680 has finished for PR 14882 at commit 50b831f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64684 has finished for PR 14882 at commit 489dd8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Aug 31, 2016

LGTM. Merging to master / 2.0.

asfgit pushed a commit that referenced this pull request Aug 31, 2016
… non-blocking

## What changes were proposed in this pull request?

StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint.

This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value).

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14882 from zsxwing/SPARK-17316.

(cherry picked from commit 9bcb33c)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@asfgit asfgit closed this in 9bcb33c Aug 31, 2016
@zsxwing zsxwing deleted the SPARK-17316 branch August 31, 2016 18:01
asfgit pushed a commit that referenced this pull request Aug 31, 2016
## What changes were proposed in this pull request?

The master is broken because #14882 didn't run mesos tests.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14902 from zsxwing/hotfix.
asfgit pushed a commit that referenced this pull request Aug 31, 2016
## What changes were proposed in this pull request?

The master is broken because #14882 didn't run mesos tests.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14902 from zsxwing/hotfix.

(cherry picked from commit d375c8a)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
asfgit pushed a commit that referenced this pull request Sep 2, 2016
… non-blocking

## What changes were proposed in this pull request?

StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint.

This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value).

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14882 from zsxwing/SPARK-17316.
@zsxwing
Copy link
Member Author

zsxwing commented Sep 2, 2016

I just checkpicked this one into branch 1.6

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Sep 4, 2016
… non-blocking

## What changes were proposed in this pull request?

StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint.

This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value).

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#14882 from zsxwing/SPARK-17316.

(cherry picked from commit b84a92c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants