-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-23020][core] Fix race in SparkAppHandle cleanup, again. #20388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Third time is the charm? There was still a race that was left in previous attempts. If the handle closes the connection, the close() implementation would clean up state that would prevent the thread from waiting on the connection thread to finish. That could cause the race causing the test flakiness reported in the bug. The fix is to move the "wait for connection thread" code to a separate close method that is used by the handle; that also simplifies the code a bit and makes it also easier to follow. I included an unrelated, but correct, change to a YARN test so that it triggers when the PR is built. Tested by inserting a sleep in the connection thread to mimic the race; test failed reliably with the sleep, passes now. (Sleep not included in the patch.) Also ran YARN tests to make sure.
@cloud-fan @sameeragarwal since you looked at this before. |
Test build #86605 has finished for PR 20388 at commit
|
seems unrelated. retest this please |
private List<Listener> listeners; | ||
private State state; | ||
private AtomicReference<State> state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale behind this? Previously we just make sure all access and modification to state is synchronized, are you changing it to AtomicReference
for performance reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new code, synchronization would cause a deadlock.
- handle calls
closeAndWait()
inside synchronized block which joins connection thread - connection thread would call
setState()
on the handle and cause a deadlock
Changing the state should be as thread-safe as before with the new code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
} | ||
} else if (!currState.isFinal()) { | ||
newState = State.LOST; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not needed anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done by dispose()
already.
@@ -99,8 +100,6 @@ boolean isDisposed() { | |||
*/ | |||
synchronized void dispose() { | |||
if (!isDisposed()) { | |||
// Unregister first to make sure that the connection with the app has been really | |||
// terminated. | |||
server.unregister(this); | |||
if (!getState().isFinal()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this if
? setState
would do nothing if state is final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary, but it makes it clear that the code only wants to override the state in that case without having to read the code for setState
.
(Or I could call setState(LOST, false)
explicitly, I guess.)
close(); | ||
|
||
Thread connThread = this.connectionThread; | ||
if (Thread.currentThread() != connThread) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a safeguard or it can really happen? i.e. the connection thread calls closeAndWait
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a safeguard. I don't think it would happen in this version of the code, but better be safe.
Test build #86616 has finished for PR 20388 at commit
|
Test build #86656 has finished for PR 20388 at commit
|
Same flaky test. retest this please |
Test build #86654 has finished for PR 20388 at commit
|
Test build #86667 has finished for PR 20388 at commit
|
thanks, merging to master/2.3! |
Third time is the charm? There was still a race that was left in previous attempts. If the handle closes the connection, the close() implementation would clean up state that would prevent the thread from waiting on the connection thread to finish. That could cause the race causing the test flakiness reported in the bug. The fix is to move the "wait for connection thread" code to a separate close method that is used by the handle; that also simplifies the code a bit and makes it also easier to follow. I included an unrelated, but correct, change to a YARN test so that it triggers when the PR is built. Tested by inserting a sleep in the connection thread to mimic the race; test failed reliably with the sleep, passes now. (Sleep not included in the patch.) Also ran YARN tests to make sure. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20388 from vanzin/SPARK-23020. (cherry picked from commit 70a68b3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Third time is the charm?
There was still a race that was left in previous attempts. If the handle
closes the connection, the close() implementation would clean up state
that would prevent the thread from waiting on the connection thread to
finish. That could cause the race causing the test flakiness reported
in the bug.
The fix is to move the "wait for connection thread" code to a separate
close method that is used by the handle; that also simplifies the code
a bit and makes it also easier to follow.
I included an unrelated, but correct, change to a YARN test so that
it triggers when the PR is built.
Tested by inserting a sleep in the connection thread to mimic the race;
test failed reliably with the sleep, passes now. (Sleep not included in
the patch.) Also ran YARN tests to make sure.