Skip to content

[SPARK-3632] ConnectionManager can run out of receive threads with authentication on #2484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

tgravescs
Copy link
Contributor

If you turn authentication on and you are using a lot of executors. There is a chance that all the of the threads in the handleMessageExecutor could be waiting to send a message because they are blocked waiting on authentication to happen. This can cause a temporary deadlock until the connection times out.

To fix it, I got rid of the wait/notify and use a single outbox but only send security messages from it until authentication has completed.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have started for PR 2484 at commit 081b765.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have finished for PR 2484 at commit 081b765.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

@rxin I think you had looked at the connectionManager, any comments?

@rxin
Copy link
Contributor

rxin commented Sep 25, 2014

Hey sorry really busy this week. I will take a look at this next week. Added to my todo list.

@tgravescs
Copy link
Contributor Author

Thanks @rxin


val message = if (securityMgr.isAuthenticationEnabled() && !isSaslComplete()) {
// only send security messages until sasl is complete
def getFirstSecurityMessage(pos: Int): Option[Message] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add @tailrec to this?

@rxin
Copy link
Contributor

rxin commented Sep 27, 2014

This is kind of scary change since I don't fully understand the mechanics of auth. Let me contact you offline about this ..

@tgravescs
Copy link
Contributor Author

Thanks @rxin. I updated based on your comments.

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have started for PR 2484 at commit d6d4175.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have finished for PR 2484 at commit d6d4175.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21039/

val message = if (securityMgr.isAuthenticationEnabled() && !isSaslComplete()) {
// only send security messages until sasl is complete
@tailrec
def getFirstSecurityMessage(pos: Int): Option[Message] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is probably more clear if you just write a while loop.

@rxin
Copy link
Contributor

rxin commented Sep 30, 2014

This change LGTM now I understand how it works.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2484 at commit b6bc80b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2484 at commit b6bc80b.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21216/

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2484 at commit a0a961d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2484 at commit a0a961d.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21217/

@rxin
Copy link
Contributor

rxin commented Oct 2, 2014

LGTM. Merging in master. Thanks.

@rxin
Copy link
Contributor

rxin commented Oct 2, 2014

Should we backport this into 1.1.1?

@asfgit asfgit closed this in 127e97b Oct 2, 2014
@tgravescs
Copy link
Contributor Author

Yes. If it doesn't merge cleanly let me know.

@rxin
Copy link
Contributor

rxin commented Oct 3, 2014

Unfortunately it doesn't merge. How large of a cluster does we need to get this issue to happen?

@tgravescs
Copy link
Contributor Author

I've seen it with 200 executors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants