Skip to content

[SPARK-23020][core] Fix race in SparkAppHandle cleanup, again. #20388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Jan 24, 2018

Third time is the charm?

There was still a race that was left in previous attempts. If the handle
closes the connection, the close() implementation would clean up state
that would prevent the thread from waiting on the connection thread to
finish. That could cause the race causing the test flakiness reported
in the bug.

The fix is to move the "wait for connection thread" code to a separate
close method that is used by the handle; that also simplifies the code
a bit and makes it also easier to follow.

I included an unrelated, but correct, change to a YARN test so that
it triggers when the PR is built.

Tested by inserting a sleep in the connection thread to mimic the race;
test failed reliably with the sleep, passes now. (Sleep not included in
the patch.) Also ran YARN tests to make sure.

Third time is the charm?

There was still a race that was left in previous attempts. If the handle
closes the connection, the close() implementation would clean up state
that would prevent the thread from waiting on the connection thread to
finish. That could cause the race causing the test flakiness reported
in the bug.

The fix is to move the "wait for connection thread" code to a separate
close method that is used by the handle; that also simplifies the code
a bit and makes it also easier to follow.

I included an unrelated, but correct, change to a YARN test so that
it triggers when the PR is built.

Tested by inserting a sleep in the connection thread to mimic the race;
test failed reliably with the sleep, passes now. (Sleep not included in
the patch.) Also ran YARN tests to make sure.
@vanzin
Copy link
Contributor Author

vanzin commented Jan 24, 2018

@cloud-fan @sameeragarwal since you looked at this before.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86605 has finished for PR 20388 at commit fb14eaa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ServerConnection extends LauncherConnection

@vanzin
Copy link
Contributor Author

vanzin commented Jan 25, 2018

seems unrelated. retest this please

private List<Listener> listeners;
private State state;
private AtomicReference<State> state;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind this? Previously we just make sure all access and modification to state is synchronized, are you changing it to AtomicReference for performance reasons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new code, synchronization would cause a deadlock.

  • handle calls closeAndWait() inside synchronized block which joins connection thread
  • connection thread would call setState() on the handle and cause a deadlock

Changing the state should be as thread-safe as before with the new code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

}
} else if (!currState.isFinal()) {
newState = State.LOST;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not needed anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done by dispose() already.

@@ -99,8 +100,6 @@ boolean isDisposed() {
*/
synchronized void dispose() {
if (!isDisposed()) {
// Unregister first to make sure that the connection with the app has been really
// terminated.
server.unregister(this);
if (!getState().isFinal()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this if? setState would do nothing if state is final.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary, but it makes it clear that the code only wants to override the state in that case without having to read the code for setState.

(Or I could call setState(LOST, false) explicitly, I guess.)

close();

Thread connThread = this.connectionThread;
if (Thread.currentThread() != connThread) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a safeguard or it can really happen? i.e. the connection thread calls closeAndWait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a safeguard. I don't think it would happen in this version of the code, but better be safe.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86616 has finished for PR 20388 at commit fb14eaa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ServerConnection extends LauncherConnection

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86656 has finished for PR 20388 at commit 358caff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jan 25, 2018

Same flaky test. retest this please

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86654 has finished for PR 20388 at commit b597705.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2018

Test build #86667 has finished for PR 20388 at commit 358caff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3!

asfgit pushed a commit that referenced this pull request Jan 26, 2018
Third time is the charm?

There was still a race that was left in previous attempts. If the handle
closes the connection, the close() implementation would clean up state
that would prevent the thread from waiting on the connection thread to
finish. That could cause the race causing the test flakiness reported
in the bug.

The fix is to move the "wait for connection thread" code to a separate
close method that is used by the handle; that also simplifies the code
a bit and makes it also easier to follow.

I included an unrelated, but correct, change to a YARN test so that
it triggers when the PR is built.

Tested by inserting a sleep in the connection thread to mimic the race;
test failed reliably with the sleep, passes now. (Sleep not included in
the patch.) Also ran YARN tests to make sure.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #20388 from vanzin/SPARK-23020.

(cherry picked from commit 70a68b3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in 70a68b3 Jan 26, 2018
@vanzin vanzin deleted the SPARK-23020 branch January 30, 2018 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants