Skip to content

Conversation

@hasnain-db
Copy link
Contributor

@hasnain-db hasnain-db commented Sep 28, 2023

What changes were proposed in this pull request?

This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in #9853.

Why are the changes needed?

Without this change, some of the new tests added in #42685 would fail

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests were run in CI.
Without this change, some of the new tests added in #42685 fail

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Sep 28, 2023
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we should backport this?

@hasnain-db
Copy link
Contributor Author

@srowen I don't know about spark OSS guidelines for when something should be backported or not. It's been in the code this way for many years and no one's complained so I think we can be fine with a backport. It's only with SSL support that this starts to become a problem, and since that functionality isn't being backported I assume we are fine?

@mridulm mridulm closed this in 2a88fea Sep 28, 2023
mridulm pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?

This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in #9853.

### Why are the changes needed?

Without this change, some of the new tests added in #42685 would fail

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests were run in CI.
Without this change, some of the new tests added in #42685 fail

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43162 from hasnain-db/spark-tls-timeout.

Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2a88fea)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
mridulm pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?

This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in #9853.

### Why are the changes needed?

Without this change, some of the new tests added in #42685 would fail

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests were run in CI.
Without this change, some of the new tests added in #42685 fail

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43162 from hasnain-db/spark-tls-timeout.

Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2a88fea)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
mridulm pushed a commit that referenced this pull request Sep 28, 2023
### What changes were proposed in this pull request?

This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in #9853.

### Why are the changes needed?

Without this change, some of the new tests added in #42685 would fail

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests were run in CI.
Without this change, some of the new tests added in #42685 fail

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43162 from hasnain-db/spark-tls-timeout.

Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2a88fea)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
@mridulm
Copy link
Contributor

mridulm commented Sep 28, 2023

Merged to master, 3.5, 3.4, 3.3
Thanks for fixing this @hasnain-db !
Thanks for the reviews @JoshRosen, @srowen :-)

@hasnain-db hasnain-db changed the title [SPARK-44937][CORE] Mark connection as timedOut in TransportClient.close [SPARK-45375][CORE] Mark connection as timedOut in TransportClient.close Sep 28, 2023
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
### What changes were proposed in this pull request?

This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache#9853.

### Why are the changes needed?

Without this change, some of the new tests added in apache#42685 would fail

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests were run in CI.
Without this change, some of the new tests added in apache#42685 fail

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#43162 from hasnain-db/spark-tls-timeout.

Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2a88fea)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
(cherry picked from commit 85bf705)
SteNicholas pushed a commit to apache/celeborn that referenced this pull request Mar 20, 2024
### What changes were proposed in this pull request?

Importing details from apache/spark#43162:

--
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache/spark#9853.

--

### Why are the changes needed?

We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there.
(I am yet to file the overall TLS support jira yet, but this is enabling work).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #2400 from mridulm/add-SPARK-45375.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
SteNicholas pushed a commit to apache/celeborn that referenced this pull request Mar 20, 2024
### What changes were proposed in this pull request?

Importing details from apache/spark#43162:

--
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache/spark#9853.

--

### Why are the changes needed?

We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there.
(I am yet to file the overall TLS support jira yet, but this is enabling work).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #2400 from mridulm/add-SPARK-45375.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
(cherry picked from commit 21d5698)
Signed-off-by: SteNicholas <programgeek@163.com>
cfmcgrady pushed a commit to cfmcgrady/incubator-celeborn that referenced this pull request Aug 21, 2025
### What changes were proposed in this pull request?

Importing details from apache/spark#43162:

--
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache/spark#9853.

--

### Why are the changes needed?

We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there.
(I am yet to file the overall TLS support jira yet, but this is enabling work).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes apache#2400 from mridulm/add-SPARK-45375.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
(cherry picked from commit 21d5698)
Signed-off-by: SteNicholas <programgeek@163.com>
cfmcgrady pushed a commit to cfmcgrady/incubator-celeborn that referenced this pull request Aug 21, 2025
Importing details from apache/spark#43162:

--
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache/spark#9853.

--

We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there.
(I am yet to file the overall TLS support jira yet, but this is enabling work).

No

Unit tests

Closes apache#2400 from mridulm/add-SPARK-45375.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
(cherry picked from commit 21d5698)
Signed-off-by: SteNicholas <programgeek@163.com>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
### What changes were proposed in this pull request?

This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in apache#9853.

### Why are the changes needed?

Without this change, some of the new tests added in apache#42685 would fail

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests were run in CI.
Without this change, some of the new tests added in apache#42685 fail

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#43162 from hasnain-db/spark-tls-timeout.

Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2a88fea)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants