Skip to content

[SPARK-23020][CORE] Fix races in launcher code, test. #20297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Jan 17, 2018

The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

The original version of this fix introduced the flipped
version of the first race described above; the connection
closing might override the handle state before the
handle might have a chance to do cleanup. The fix there
is to only dispose of the handle from the connection
when there is an error, and let the handle dispose
itself in the normal case.

The fix also caused a bug in YarnClusterSuite to be surfaced;
the code was checking for a file in the classpath that was
not expected to be there in client mode. Because of the above
issues, the error was not propagating correctly and the (buggy)
test was incorrectly passing.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).

The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

The original version of this fix introduced the flipped
version of the first race described above; the connection
closing might override the handle state before the
handle might have a chance to do cleanup. The fix there
is to only dispose of the handle from the connection
when there is an error, and let the handle dispose
itself in the normal case.

The fix also cause a bug in YarnClusterSuite to be surfaced;
the code was checking for a file in the classpath that was
not expected to be there in client mode. Because of the above
issues, the error was not propagating correctly and the (buggy)
test was incorrectly passing.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).
@@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException {
timeout.cancel();
}
close();
if (handle != null) {
Copy link
Contributor Author

@vanzin vanzin Jan 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for the new race described in the summary (the code is moved here from below).

This changes behavior slightly: the handle now waits for the child process / thread to finish before disposing itself, whereas before that would happen as soon as the connection with the child process / thread was closed.

@@ -381,7 +381,9 @@ private object YarnClusterDriver extends Logging with Matchers {

// Verify that the config archive is correctly placed in the classpath of all containers.
val confFile = "/" + Client.SPARK_CONF_FILE
assert(getClass().getResource(confFile) != null)
if (conf.getOption(SparkLauncher.DEPLOY_MODE) == Some("cluster")) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug in the YARN tests that the fix uncovered.

@vanzin
Copy link
Contributor Author

vanzin commented Jan 17, 2018

@sameeragarwal @cloud-fan @gengliangwang since you guys looked at the previous PR.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86289 has finished for PR 20297 at commit 8bde21a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jan 18, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jan 18, 2018

Test build #86351 has finished for PR 20297 at commit 8bde21a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (childProc.isAlive()) {
childProc.destroyForcibly();
if (!isDisposed()) {
setState(State.KILLED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we should set the state to KILLED once the kill method is called? Even the code below fails(throw exception), the state should still be KILLED?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1,I have the same question in last review. We should figure it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the calls below should raise exceptions. Even the socket close is wrapped in a try..catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even the order doesn't matter, I think it's more conventional to set the state at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the comment I made in the previous PR applies. Closing the socket / killing the child can have other implications (like changing the state) and it's easier to reason about what happens if the desired state change happens first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok makes sense.

@@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws IOException {
}

@Override
public void close() throws IOException {
public synchronized void close() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need to change this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never needed to change it, but the extra code wasn't doing anything useful, so I chose the simpler version.

LOG.log(Level.WARNING, "Lost connection to spark application.");
handle.setState(SparkAppHandle.State.LOST);
}
handle.disconnect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we don't disconnect now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


synchronized (this) {
super.close();
notifyAll();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why notifyAll? who might be waiting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See L239.

@@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws IOException {
}

@Override
public void close() throws IOException {
public synchronized void close() throws IOException {
if (!closed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> isOpen

@sameeragarwal
Copy link
Member

I've kicked a bunch of test builds in parallel to further rule out any flakiness on jenkins. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #4061 has finished for PR 20297 at commit 8bde21a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #4060 has finished for PR 20297 at commit 8bde21a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #4062 has finished for PR 20297 at commit 8bde21a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jan 19, 2018

I kicked an extra couple of builds aside from the one that should auto-trigger.

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #86395 has finished for PR 20297 at commit 95bac27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #4066 has finished for PR 20297 at commit 95bac27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #4067 has finished for PR 20297 at commit 95bac27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jan 20, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2018

Test build #86405 has finished for PR 20297 at commit 95bac27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2018

Test build #4068 has finished for PR 20297 at commit 95bac27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3!

asfgit pushed a commit that referenced this pull request Jan 22, 2018
The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

The original version of this fix introduced the flipped
version of the first race described above; the connection
closing might override the handle state before the
handle might have a chance to do cleanup. The fix there
is to only dispose of the handle from the connection
when there is an error, and let the handle dispose
itself in the normal case.

The fix also caused a bug in YarnClusterSuite to be surfaced;
the code was checking for a file in the classpath that was
not expected to be there in client mode. Because of the above
issues, the error was not propagating correctly and the (buggy)
test was incorrectly passing.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #20297 from vanzin/SPARK-23020.

(cherry picked from commit ec22897)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in ec22897 Jan 22, 2018
@vanzin vanzin deleted the SPARK-23020 branch January 22, 2018 21:37
asfgit pushed a commit that referenced this pull request Jan 24, 2018
## What changes were proposed in this pull request?

This is a follow-up of #20297 which broke lint-java checks.
This pr fixes the lint-java issues.

```
[ERROR] src/test/java/org/apache/spark/launcher/BaseSuite.java:[21,8] (imports) UnusedImports: Unused import - java.util.concurrent.TimeUnit.
[ERROR] src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java:[27,8] (imports) UnusedImports: Unused import - java.util.concurrent.TimeUnit.
```

## How was this patch tested?

Checked manually in my local environment.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #20376 from ueshin/issues/SPARK-23020/fup1.

(cherry picked from commit 8c273b4)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
ghost pushed a commit to dbtsai/spark that referenced this pull request Jan 24, 2018
## What changes were proposed in this pull request?

This is a follow-up of apache#20297 which broke lint-java checks.
This pr fixes the lint-java issues.

```
[ERROR] src/test/java/org/apache/spark/launcher/BaseSuite.java:[21,8] (imports) UnusedImports: Unused import - java.util.concurrent.TimeUnit.
[ERROR] src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java:[27,8] (imports) UnusedImports: Unused import - java.util.concurrent.TimeUnit.
```

## How was this patch tested?

Checked manually in my local environment.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes apache#20376 from ueshin/issues/SPARK-23020/fup1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants