-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-23020][core] Fix races in launcher code, test. #20223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The race in the code is because the handle might update its state to the wrong state if the connection handling thread is still processing incoming data; so the handle needs to wait for the connection to finish up before checking the final state. The race in the test is because when waiting for a handle to reach a final state, the waitFor() method needs to wait until all handle state is updated (which also includes waiting for the connection thread above to finish). Otherwise, waitFor() may return too early, which would cause a bunch of different races (like the listener not being yet notified of the state change, or being in the middle of being notified, or the handle not being properly disposed and causing postChecks() to assert). On top of that I found, by code inspection, a couple of potential races that could make a handle end up in the wrong state when being killed. Tested by running the existing unit tests a lot (and not seeing the errors I was seeing before).
Test build #85929 has finished for PR 20223 at commit
|
Test build #85937 has finished for PR 20223 at commit
|
@@ -91,10 +92,15 @@ LauncherConnection getConnection() { | |||
return connection; | |||
} | |||
|
|||
boolean isDisposed() { | |||
synchronized boolean isDisposed() { | |||
return disposed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we simply mark disposed
as transient
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The synchronized
is not protecting the variable, but all the actions that happen around the code that sets the variable.
So this method returning guarantees that either all the disposal tasks have run or none have.
That being said ServerConnection.close
is not really holding the handle lock as it should, so I might have to make some changes here.
These are not really contended locks. In the worst case there will be 2 threads looking the them. So trying to come up with a finer-grained locking scheme here sounds like more trouble than it's worth.
if (!closed) { | ||
synchronized (this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's wrong with this? It's a classic "double-checked locking" in java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sub-class needs coarser-grained locking so this wasn't really doing anything useful.
@@ -137,7 +139,9 @@ public void testInProcessLauncher() throws Exception { | |||
// Here DAGScheduler is stopped, while SparkContext.clearActiveContext may not be called yet. | |||
// Wait for a reasonable amount of time to avoid creating two active SparkContext in JVM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is 500 ms
not a reasonable waiting time anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here Vanzin is waiting for the active SparkContext being empty within 5 seconds. It is a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that 500ms is not reasonable, it's that this code will return more quickly when the context shuts down in under 500ms, have a longer grace period in case it ever takes more than 500ms, and fail the test here if the context does not shut down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of synchronized
, and some of them seems unnecessary
return disposed; | ||
} | ||
|
||
synchronized void markDisposed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need synchronized
here?
@@ -91,10 +92,15 @@ LauncherConnection getConnection() { | |||
return connection; | |||
} | |||
|
|||
boolean isDisposed() { | |||
synchronized boolean isDisposed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need synchronized
here
@@ -71,15 +71,16 @@ public void stop() { | |||
@Override | |||
public synchronized void disconnect() { | |||
if (!disposed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(!isDisposed())
?
Test build #85984 has finished for PR 20223 at commit
|
Test build #85986 has finished for PR 20223 at commit
|
if (childProc.isAlive()) { | ||
childProc.destroyForcibly(); | ||
if (!isDisposed()) { | ||
setState(State.KILLED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why put setState
in the front? It should be OK but in previous code some of the setState
is called in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question, but should be OK as we always synchronize when touching the state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this because the old disconnect code, at least, might change the handle's state. It's easier to put this call first and not have to reason about whether that will happen.
Retest this please. |
Test build #86142 has finished for PR 20223 at commit
|
LGTM |
merging to master/2.3. Thanks! |
The race in the code is because the handle might update its state to the wrong state if the connection handling thread is still processing incoming data; so the handle needs to wait for the connection to finish up before checking the final state. The race in the test is because when waiting for a handle to reach a final state, the waitFor() method needs to wait until all handle state is updated (which also includes waiting for the connection thread above to finish). Otherwise, waitFor() may return too early, which would cause a bunch of different races (like the listener not being yet notified of the state change, or being in the middle of being notified, or the handle not being properly disposed and causing postChecks() to assert). On top of that I found, by code inspection, a couple of potential races that could make a handle end up in the wrong state when being killed. Tested by running the existing unit tests a lot (and not seeing the errors I was seeing before). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20223 from vanzin/SPARK-23020. (cherry picked from commit 66217da) Signed-off-by: Sameer Agarwal <sameerag@apache.org>
This patch unfortunately broke |
We didn't run the yarn test with this PR due to https://issues.apache.org/jira/browse/SPARK-10300 This indicates that it might be a bad idea to skip the yarn test if we don't change yarn code. In this case, we touch the code in spark core and cause the failure in yarn test. |
At this point in time it might be a good idea to revert SPARK-10300, since the YARN tests don't really add that much to the test run time compared to the others anymore... |
The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.
The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).
On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.
Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).