-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-11865] [network] Avoid returning inactive client in TransportClientFactory. #9853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…lientFactory. There's a very narrow race here where it would be possible for the timeout handler to close a channel after the client factory verified that the channel was still active. This change makes sure the client is marked as being recently in use so that the timeout handler does not close it until a new timeout cycle elapses.
Test build #46377 has finished for PR 9853 at commit
|
Another mysterious failure where there are no failures... retest this please |
Test build #46431 has finished for PR 9853 at commit
|
Since |
Since ctx.close() is asynchronous, this ensures that threads checking for the client being alive get the right result.
@zsxwing true. Fixed. |
LGTM pending tests. |
Test build #46541 has finished for PR 9853 at commit
|
Merging to master / 1.6. |
asfgit
pushed a commit
that referenced
this pull request
Nov 23, 2015
…ientFactory. There's a very narrow race here where it would be possible for the timeout handler to close a channel after the client factory verified that the channel was still active. This change makes sure the client is marked as being recently in use so that the timeout handler does not close it until a new timeout cycle elapses. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9853 from vanzin/SPARK-11865. (cherry picked from commit 7cfa4c6) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
mridulm
pushed a commit
that referenced
this pull request
Sep 28, 2023
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
mridulm
pushed a commit
that referenced
this pull request
Sep 28, 2023
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
mridulm
pushed a commit
that referenced
this pull request
Sep 28, 2023
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
mridulm
pushed a commit
that referenced
this pull request
Sep 28, 2023
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
viirya
pushed a commit
to viirya/spark-1
that referenced
this pull request
Oct 19, 2023
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache#9853. ### Why are the changes needed? Without this change, some of the new tests added in apache#42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in apache#42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> (cherry picked from commit 85bf705)
SteNicholas
pushed a commit
to apache/celeborn
that referenced
this pull request
Mar 20, 2024
### What changes were proposed in this pull request? Importing details from apache/spark#43162: -- This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache/spark#9853. -- ### Why are the changes needed? We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there. (I am yet to file the overall TLS support jira yet, but this is enabling work). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #2400 from mridulm/add-SPARK-45375. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: SteNicholas <programgeek@163.com>
SteNicholas
pushed a commit
to apache/celeborn
that referenced
this pull request
Mar 20, 2024
### What changes were proposed in this pull request? Importing details from apache/spark#43162: -- This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache/spark#9853. -- ### Why are the changes needed? We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there. (I am yet to file the overall TLS support jira yet, but this is enabling work). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #2400 from mridulm/add-SPARK-45375. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: SteNicholas <programgeek@163.com> (cherry picked from commit 21d5698) Signed-off-by: SteNicholas <programgeek@163.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There's a very narrow race here where it would be possible for the timeout handler
to close a channel after the client factory verified that the channel was still
active. This change makes sure the client is marked as being recently in use so
that the timeout handler does not close it until a new timeout cycle elapses.