-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-45375][CORE] Mark connection as timedOut in TransportClient.close #43162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we should backport this?
|
@srowen I don't know about spark OSS guidelines for when something should be backported or not. It's been in the code this way for many years and no one's complained so I think we can be fine with a backport. It's only with SSL support that this starts to become a problem, and since that functionality isn't being backported I assume we are fine? |
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
|
Merged to master, 3.5, 3.4, 3.3 |
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache#9853. ### Why are the changes needed? Without this change, some of the new tests added in apache#42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in apache#42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> (cherry picked from commit 85bf705)
### What changes were proposed in this pull request? Importing details from apache/spark#43162: -- This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache/spark#9853. -- ### Why are the changes needed? We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there. (I am yet to file the overall TLS support jira yet, but this is enabling work). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #2400 from mridulm/add-SPARK-45375. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request? Importing details from apache/spark#43162: -- This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache/spark#9853. -- ### Why are the changes needed? We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there. (I am yet to file the overall TLS support jira yet, but this is enabling work). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #2400 from mridulm/add-SPARK-45375. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: SteNicholas <programgeek@163.com> (cherry picked from commit 21d5698) Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request? Importing details from apache/spark#43162: -- This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache/spark#9853. -- ### Why are the changes needed? We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there. (I am yet to file the overall TLS support jira yet, but this is enabling work). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes apache#2400 from mridulm/add-SPARK-45375. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: SteNicholas <programgeek@163.com> (cherry picked from commit 21d5698) Signed-off-by: SteNicholas <programgeek@163.com>
Importing details from apache/spark#43162: -- This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache/spark#9853. -- We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there. (I am yet to file the overall TLS support jira yet, but this is enabling work). No Unit tests Closes apache#2400 from mridulm/add-SPARK-45375. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: SteNicholas <programgeek@163.com> (cherry picked from commit 21d5698) Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache#9853. ### Why are the changes needed? Without this change, some of the new tests added in apache#42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in apache#42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
What changes were proposed in this pull request?
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.
This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.
Looking at the history of the code I believe this was an oversight in #9853.
Why are the changes needed?
Without this change, some of the new tests added in #42685 would fail
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests were run in CI.
Without this change, some of the new tests added in #42685 fail
Was this patch authored or co-authored using generative AI tooling?
No