Skip to content

[SPARK-11865] [network] Avoid returning inactive client in TransportClientFactory. #9853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Nov 20, 2015

There's a very narrow race here where it would be possible for the timeout handler
to close a channel after the client factory verified that the channel was still
active. This change makes sure the client is marked as being recently in use so
that the timeout handler does not close it until a new timeout cycle elapses.

…lientFactory.

There's a very narrow race here where it would be possible for the timeout handler
to close a channel after the client factory verified that the channel was still
active. This change makes sure the client is marked as being recently in use so
that the timeout handler does not close it until a new timeout cycle elapses.
@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46377 has finished for PR 9853 at commit f836391.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 20, 2015

Another mysterious failure where there are no failures... retest this please

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46431 has finished for PR 9853 at commit f836391.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 23, 2015

/cc @rxin @zsxwing ; we should probably get this into 1.6

@zsxwing
Copy link
Member

zsxwing commented Nov 23, 2015

Since ctx.close() is asynchronous, this one doesn't fix the race totally. Right?

Since ctx.close() is asynchronous, this ensures that threads checking
for the client being alive get the right result.
@vanzin
Copy link
Contributor Author

vanzin commented Nov 23, 2015

@zsxwing true. Fixed.

@zsxwing
Copy link
Member

zsxwing commented Nov 23, 2015

LGTM pending tests.

@SparkQA
Copy link

SparkQA commented Nov 23, 2015

Test build #46541 has finished for PR 9853 at commit 83188ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 23, 2015

Merging to master / 1.6.

asfgit pushed a commit that referenced this pull request Nov 23, 2015
…ientFactory.

There's a very narrow race here where it would be possible for the timeout handler
to close a channel after the client factory verified that the channel was still
active. This change makes sure the client is marked as being recently in use so
that the timeout handler does not close it until a new timeout cycle elapses.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9853 from vanzin/SPARK-11865.

(cherry picked from commit 7cfa4c6)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@asfgit asfgit closed this in 7cfa4c6 Nov 23, 2015
@vanzin vanzin deleted the SPARK-11865 branch November 23, 2015 21:55
mridulm pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?

This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in #9853.

### Why are the changes needed?

Without this change, some of the new tests added in #42685 would fail

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests were run in CI.
Without this change, some of the new tests added in #42685 fail

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43162 from hasnain-db/spark-tls-timeout.

Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
mridulm pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?

This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in #9853.

### Why are the changes needed?

Without this change, some of the new tests added in #42685 would fail

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests were run in CI.
Without this change, some of the new tests added in #42685 fail

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43162 from hasnain-db/spark-tls-timeout.

Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2a88fea)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
mridulm pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?

This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in #9853.

### Why are the changes needed?

Without this change, some of the new tests added in #42685 would fail

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests were run in CI.
Without this change, some of the new tests added in #42685 fail

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43162 from hasnain-db/spark-tls-timeout.

Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2a88fea)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
mridulm pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?

This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in #9853.

### Why are the changes needed?

Without this change, some of the new tests added in #42685 would fail

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests were run in CI.
Without this change, some of the new tests added in #42685 fail

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43162 from hasnain-db/spark-tls-timeout.

Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2a88fea)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
### What changes were proposed in this pull request?

This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache#9853.

### Why are the changes needed?

Without this change, some of the new tests added in apache#42685 would fail

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests were run in CI.
Without this change, some of the new tests added in apache#42685 fail

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#43162 from hasnain-db/spark-tls-timeout.

Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2a88fea)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
(cherry picked from commit 85bf705)
SteNicholas pushed a commit to apache/celeborn that referenced this pull request Mar 20, 2024
### What changes were proposed in this pull request?

Importing details from apache/spark#43162:

--
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache/spark#9853.

--

### Why are the changes needed?

We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there.
(I am yet to file the overall TLS support jira yet, but this is enabling work).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #2400 from mridulm/add-SPARK-45375.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
SteNicholas pushed a commit to apache/celeborn that referenced this pull request Mar 20, 2024
### What changes were proposed in this pull request?

Importing details from apache/spark#43162:

--
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache/spark#9853.

--

### Why are the changes needed?

We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there.
(I am yet to file the overall TLS support jira yet, but this is enabling work).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #2400 from mridulm/add-SPARK-45375.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
(cherry picked from commit 21d5698)
Signed-off-by: SteNicholas <programgeek@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants