[SPARK-23020][core] Fix race in SparkAppHandle cleanup, again. #20388

vanzin · 2018-01-24T22:14:32Z

Third time is the charm?

There was still a race that was left in previous attempts. If the handle
closes the connection, the close() implementation would clean up state
that would prevent the thread from waiting on the connection thread to
finish. That could cause the race causing the test flakiness reported
in the bug.

The fix is to move the "wait for connection thread" code to a separate
close method that is used by the handle; that also simplifies the code
a bit and makes it also easier to follow.

I included an unrelated, but correct, change to a YARN test so that
it triggers when the PR is built.

Tested by inserting a sleep in the connection thread to mimic the race;
test failed reliably with the sleep, passes now. (Sleep not included in
the patch.) Also ran YARN tests to make sure.

Third time is the charm? There was still a race that was left in previous attempts. If the handle closes the connection, the close() implementation would clean up state that would prevent the thread from waiting on the connection thread to finish. That could cause the race causing the test flakiness reported in the bug. The fix is to move the "wait for connection thread" code to a separate close method that is used by the handle; that also simplifies the code a bit and makes it also easier to follow. I included an unrelated, but correct, change to a YARN test so that it triggers when the PR is built. Tested by inserting a sleep in the connection thread to mimic the race; test failed reliably with the sleep, passes now. (Sleep not included in the patch.) Also ran YARN tests to make sure.

vanzin · 2018-01-24T22:14:48Z

@cloud-fan @sameeragarwal since you looked at this before.

SparkQA · 2018-01-25T01:43:57Z

Test build #86605 has finished for PR 20388 at commit fb14eaa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ServerConnection extends LauncherConnection

vanzin · 2018-01-25T01:45:31Z

seems unrelated. retest this please

cloud-fan · 2018-01-25T02:44:44Z

launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java

  private List<Listener> listeners;
-  private State state;
+  private AtomicReference<State> state;


What's the rationale behind this? Previously we just make sure all access and modification to state is synchronized, are you changing it to AtomicReference for performance reasons?

With the new code, synchronization would cause a deadlock.

handle calls closeAndWait() inside synchronized block which joins connection thread

connection thread would call setState() on the handle and cause a deadlock

Changing the state should be as thread-safe as before with the new code.

makes sense

cloud-fan · 2018-01-25T03:29:34Z

launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java

        }
-      } else if (!currState.isFinal()) {
-        newState = State.LOST;


why is this not needed anymore?

This is done by dispose() already.

cloud-fan · 2018-01-25T03:34:19Z

launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java

@@ -99,8 +100,6 @@ boolean isDisposed() {
   */
  synchronized void dispose() {
    if (!isDisposed()) {
-      // Unregister first to make sure that the connection with the app has been really
-      // terminated.
      server.unregister(this);
      if (!getState().isFinal()) {


do we need this if? setState would do nothing if state is final.

It's not necessary, but it makes it clear that the code only wants to override the state in that case without having to read the code for setState.

(Or I could call setState(LOST, false) explicitly, I guess.)

cloud-fan · 2018-01-25T03:40:53Z

launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java

+      close();
+
+      Thread connThread = this.connectionThread;
+      if (Thread.currentThread() != connThread) {


Is this a safeguard or it can really happen? i.e. the connection thread calls closeAndWait.

More of a safeguard. I don't think it would happen in this version of the code, but better be safe.

SparkQA · 2018-01-25T05:53:50Z

Test build #86616 has finished for PR 20388 at commit fb14eaa.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ServerConnection extends LauncherConnection

SparkQA · 2018-01-25T22:58:39Z

Test build #86656 has finished for PR 20388 at commit 358caff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-01-25T23:02:14Z

Same flaky test. retest this please

SparkQA · 2018-01-25T23:24:29Z

Test build #86654 has finished for PR 20388 at commit b597705.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-26T03:04:54Z

Test build #86667 has finished for PR 20388 at commit 358caff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-26T03:59:00Z

thanks, merging to master/2.3!

Third time is the charm? There was still a race that was left in previous attempts. If the handle closes the connection, the close() implementation would clean up state that would prevent the thread from waiting on the connection thread to finish. That could cause the race causing the test flakiness reported in the bug. The fix is to move the "wait for connection thread" code to a separate close method that is used by the handle; that also simplifies the code a bit and makes it also easier to follow. I included an unrelated, but correct, change to a YARN test so that it triggers when the PR is built. Tested by inserting a sleep in the connection thread to mimic the race; test failed reliably with the sleep, passes now. (Sleep not included in the patch.) Also ran YARN tests to make sure. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20388 from vanzin/SPARK-23020. (cherry picked from commit 70a68b3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan reviewed Jan 25, 2018

View reviewed changes

Marcelo Vanzin added 2 commits January 25, 2018 11:13

Small cleanup.

b597705

Get ref to connection thread earlier.

358caff

asfgit closed this in 70a68b3 Jan 26, 2018

vanzin deleted the SPARK-23020 branch January 30, 2018 18:47

[SPARK-23020][core] Fix race in SparkAppHandle cleanup, again. #20388

[SPARK-23020][core] Fix race in SparkAppHandle cleanup, again. #20388

Uh oh!

Conversation

vanzin commented Jan 24, 2018

Uh oh!

vanzin commented Jan 24, 2018

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

vanzin commented Jan 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

vanzin commented Jan 25, 2018

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

SparkQA commented Jan 26, 2018

Uh oh!

cloud-fan commented Jan 26, 2018

Uh oh!

Uh oh!