-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17316][Core] Make CoarseGrainedSchedulerBackend.removeExecutor non-blocking #14882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #64668 has finished for PR 14882 at commit
|
/cc @vanzin |
@@ -407,7 +408,7 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp | |||
} | |||
|
|||
// Called by subclasses when notified of a lost worker | |||
def removeExecutor(executorId: String, reason: ExecutorLossReason) { | |||
protected def removeExecutor(executorId: String, reason: ExecutorLossReason) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to protected
since it's called only by subclasses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you're here can you fix this to have a proper signature (add return type)?
@@ -416,6 +417,12 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp | |||
} | |||
} | |||
|
|||
protected def removeExecutorAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we can't just make removeExecutor
always use the async API?
It doesn't look like any of the call sites actually relies on it being synchronous.
If Another crazy option, given that the RPC layer is reliable, is to use |
Yeah, agreed that it's safe to just use |
*/ | ||
protected def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = { | ||
// Only log the failure since we don't care about the result. | ||
driverEndpoint.ask(RemoveExecutor(executorId, reason)).onFailure { case t => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to use ask
since there are some places using ask(RemoveExecutor)
.
Test build #64683 has finished for PR 14882 at commit
|
Test build #64681 has finished for PR 14882 at commit
|
Test build #64677 has finished for PR 14882 at commit
|
Test build #64679 has finished for PR 14882 at commit
|
Test build #64680 has finished for PR 14882 at commit
|
Test build #64684 has finished for PR 14882 at commit
|
LGTM. Merging to master / 2.0. |
… non-blocking ## What changes were proposed in this pull request? StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint. This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value). ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14882 from zsxwing/SPARK-17316. (cherry picked from commit 9bcb33c) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request? The master is broken because #14882 didn't run mesos tests. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14902 from zsxwing/hotfix. (cherry picked from commit d375c8a) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
… non-blocking ## What changes were proposed in this pull request? StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint. This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value). ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14882 from zsxwing/SPARK-17316.
I just checkpicked this one into branch 1.6 |
… non-blocking ## What changes were proposed in this pull request? StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint. This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value). ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#14882 from zsxwing/SPARK-17316. (cherry picked from commit b84a92c)
What changes were proposed in this pull request?
StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint.
This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value).
How was this patch tested?
Jenkins unit tests.