-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-23020][CORE] Fix races in launcher code, test. #20297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -217,6 +217,33 @@ void unregister(AbstractAppHandle handle) { | |
break; | ||
} | ||
} | ||
|
||
// If there is a live connection for this handle, we need to wait for it to finish before | ||
// returning, otherwise there might be a race between the connection thread processing | ||
// buffered data and the handle cleaning up after itself, leading to potentially the wrong | ||
// state being reported for the handle. | ||
ServerConnection conn = null; | ||
synchronized (clients) { | ||
for (ServerConnection c : clients) { | ||
if (c.handle == handle) { | ||
conn = c; | ||
break; | ||
} | ||
} | ||
} | ||
|
||
if (conn != null) { | ||
synchronized (conn) { | ||
if (conn.isOpen()) { | ||
try { | ||
conn.wait(); | ||
} catch (InterruptedException ie) { | ||
// Ignore. | ||
} | ||
} | ||
} | ||
} | ||
|
||
unref(); | ||
} | ||
|
||
|
@@ -288,7 +315,7 @@ private String createSecret() { | |
private class ServerConnection extends LauncherConnection { | ||
|
||
private TimerTask timeout; | ||
private AbstractAppHandle handle; | ||
volatile AbstractAppHandle handle; | ||
|
||
ServerConnection(Socket socket, TimerTask timeout) throws IOException { | ||
super(socket); | ||
|
@@ -313,7 +340,7 @@ protected void handle(Message msg) throws IOException { | |
} else { | ||
if (handle == null) { | ||
throw new IllegalArgumentException("Expected hello, got: " + | ||
msg != null ? msg.getClass().getName() : null); | ||
msg != null ? msg.getClass().getName() : null); | ||
} | ||
if (msg instanceof SetAppId) { | ||
SetAppId set = (SetAppId) msg; | ||
|
@@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException { | |
timeout.cancel(); | ||
} | ||
close(); | ||
if (handle != null) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the fix for the new race described in the summary (the code is moved here from below). This changes behavior slightly: the handle now waits for the child process / thread to finish before disposing itself, whereas before that would happen as soon as the connection with the child process / thread was closed. |
||
handle.dispose(); | ||
} | ||
} finally { | ||
timeoutTimer.purge(); | ||
} | ||
} | ||
|
||
@Override | ||
public void close() throws IOException { | ||
if (!isOpen()) { | ||
return; | ||
} | ||
|
||
synchronized (clients) { | ||
clients.remove(this); | ||
} | ||
super.close(); | ||
if (handle != null) { | ||
if (!handle.getState().isFinal()) { | ||
LOG.log(Level.WARNING, "Lost connection to spark application."); | ||
handle.setState(SparkAppHandle.State.LOST); | ||
} | ||
handle.disconnect(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why we don't disconnect now? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See #20297 (review) |
||
|
||
synchronized (this) { | ||
super.close(); | ||
notifyAll(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See L239. |
||
} | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we should set the state to
KILLED
once thekill
method is called? Even the code below fails(throw exception), the state should still beKILLED
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1,I have the same question in last review. We should figure it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the calls below should raise exceptions. Even the socket close is wrapped in a try..catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even the order doesn't matter, I think it's more conventional to set the state at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the comment I made in the previous PR applies. Closing the socket / killing the child can have other implications (like changing the state) and it's easier to reason about what happens if the desired state change happens first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok makes sense.