-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-3632] ConnectionManager can run out of receive threads with authentication on #2484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
QA tests have started for PR 2484 at commit
|
QA tests have finished for PR 2484 at commit
|
@rxin I think you had looked at the connectionManager, any comments? |
Hey sorry really busy this week. I will take a look at this next week. Added to my todo list. |
Thanks @rxin |
|
||
val message = if (securityMgr.isAuthenticationEnabled() && !isSaslComplete()) { | ||
// only send security messages until sasl is complete | ||
def getFirstSecurityMessage(pos: Int): Option[Message] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add @tailrec to this?
This is kind of scary change since I don't fully understand the mechanics of auth. Let me contact you offline about this .. |
Thanks @rxin. I updated based on your comments. |
QA tests have started for PR 2484 at commit
|
QA tests have finished for PR 2484 at commit
|
Test PASSed. |
val message = if (securityMgr.isAuthenticationEnabled() && !isSaslComplete()) { | ||
// only send security messages until sasl is complete | ||
@tailrec | ||
def getFirstSecurityMessage(pos: Int): Option[Message] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is probably more clear if you just write a while loop.
This change LGTM now I understand how it works. |
QA tests have started for PR 2484 at commit
|
QA tests have finished for PR 2484 at commit
|
Test FAILed. |
QA tests have started for PR 2484 at commit
|
QA tests have finished for PR 2484 at commit
|
Test PASSed. |
LGTM. Merging in master. Thanks. |
Should we backport this into 1.1.1? |
Yes. If it doesn't merge cleanly let me know. |
Unfortunately it doesn't merge. How large of a cluster does we need to get this issue to happen? |
I've seen it with 200 executors. |
If you turn authentication on and you are using a lot of executors. There is a chance that all the of the threads in the handleMessageExecutor could be waiting to send a message because they are blocked waiting on authentication to happen. This can cause a temporary deadlock until the connection times out.
To fix it, I got rid of the wait/notify and use a single outbox but only send security messages from it until authentication has completed.