Skip to content

Commit

Permalink
[SPARK-36532][CORE] Fix deadlock in CoarseGrainedExecutorBackend.onDi…
Browse files Browse the repository at this point in the history
…sconnected to avoid executor shutdown hang

### What changes were proposed in this pull request?

Instead of exiting the executor within the RpcEnv's thread, exit the executor in a separate thread.

### Why are the changes needed?

The current exit way in `onDisconnected` can cause the deadlock, which has the exact same root cause with apache#12012:

* `onDisconnected` -> `System.exit` are called in sequence in the thread of `MessageLoop.threadpool`
* `System.exit` triggers shutdown hooks and `executor.stop` is one of the hooks.
* `executor.stop` stops the `Dispatcher`, which waits for the `MessageLoop.threadpool`  to shutdown further.
* Thus, the thread which runs `System.exit` waits for hooks to be done, but the `MessageLoop.threadpool` in the hook waits that thread to finish. Finally, this mutual dependence results in the deadlock.

### Does this PR introduce _any_ user-facing change?

Yes, the executor shutdown won't hang.

### How was this patch tested?

Pass existing tests.

Closes apache#33759 from Ngone51/fix-executor-shutdown-hang.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
Ngone51 authored and cloud-fan committed Aug 18, 2021
1 parent a1ecf83 commit 996551f
Showing 1 changed file with 12 additions and 7 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -202,11 +202,17 @@ private[spark] class CoarseGrainedExecutorBackend(
stopping.set(true)
new Thread("CoarseGrainedExecutorBackend-stop-executor") {
override def run(): Unit = {
// executor.stop() will call `SparkEnv.stop()` which waits until RpcEnv stops totally.
// However, if `executor.stop()` runs in some thread of RpcEnv, RpcEnv won't be able to
// stop until `executor.stop()` returns, which becomes a dead-lock (See SPARK-14180).
// Therefore, we put this line in a new thread.
executor.stop()
// `executor` can be null if there's any error in `CoarseGrainedExecutorBackend.onStart`
// or fail to create `Executor`.
if (executor == null) {
System.exit(1)
} else {
// executor.stop() will call `SparkEnv.stop()` which waits until RpcEnv stops totally.
// However, if `executor.stop()` runs in some thread of RpcEnv, RpcEnv won't be able to
// stop until `executor.stop()` returns, which becomes a dead-lock (See SPARK-14180).
// Therefore, we put this line in a new thread.
executor.stop()
}
}
}.start()

Expand Down Expand Up @@ -286,8 +292,7 @@ private[spark] class CoarseGrainedExecutorBackend(
if (notifyDriver && driver.nonEmpty) {
driver.get.send(RemoveExecutor(executorId, new ExecutorLossReason(reason)))
}

System.exit(code)
self.send(Shutdown)
} else {
logInfo("Skip exiting executor since it's been already asked to exit before.")
}
Expand Down

0 comments on commit 996551f

Please sign in to comment.