[SPARK-11865] [network] Avoid returning inactive client in TransportClientFactory. #9853

vanzin · 2015-11-20T00:42:43Z

There's a very narrow race here where it would be possible for the timeout handler
to close a channel after the client factory verified that the channel was still
active. This change makes sure the client is marked as being recently in use so
that the timeout handler does not close it until a new timeout cycle elapses.

…lientFactory. There's a very narrow race here where it would be possible for the timeout handler to close a channel after the client factory verified that the channel was still active. This change makes sure the client is marked as being recently in use so that the timeout handler does not close it until a new timeout cycle elapses.

SparkQA · 2015-11-20T01:21:13Z

Test build #46377 has finished for PR 9853 at commit f836391.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-11-20T19:52:14Z

Another mysterious failure where there are no failures... retest this please

SparkQA · 2015-11-20T22:50:55Z

Test build #46431 has finished for PR 9853 at commit f836391.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-11-23T17:30:03Z

/cc @rxin @zsxwing ; we should probably get this into 1.6

zsxwing · 2015-11-23T17:56:03Z

Since ctx.close() is asynchronous, this one doesn't fix the race totally. Right?

Since ctx.close() is asynchronous, this ensures that threads checking for the client being alive get the right result.

vanzin · 2015-11-23T18:31:58Z

@zsxwing true. Fixed.

zsxwing · 2015-11-23T18:38:12Z

LGTM pending tests.

SparkQA · 2015-11-23T21:44:29Z

Test build #46541 has finished for PR 9853 at commit 83188ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-11-23T21:51:21Z

Merging to master / 1.6.

…ientFactory. There's a very narrow race here where it would be possible for the timeout handler to close a channel after the client factory verified that the channel was still active. This change makes sure the client is marked as being recently in use so that the timeout handler does not close it until a new timeout cycle elapses. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9853 from vanzin/SPARK-11865. (cherry picked from commit 7cfa4c6) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache#9853. ### Why are the changes needed? Without this change, some of the new tests added in apache#42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in apache#42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> (cherry picked from commit 85bf705)

### What changes were proposed in this pull request? Importing details from apache/spark#43162: -- This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache/spark#9853. -- ### Why are the changes needed? We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there. (I am yet to file the overall TLS support jira yet, but this is enabling work). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #2400 from mridulm/add-SPARK-45375. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: SteNicholas <programgeek@163.com>

### What changes were proposed in this pull request? Importing details from apache/spark#43162: -- This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache/spark#9853. -- ### Why are the changes needed? We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there. (I am yet to file the overall TLS support jira yet, but this is enabling work). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #2400 from mridulm/add-SPARK-45375. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: SteNicholas <programgeek@163.com> (cherry picked from commit 21d5698) Signed-off-by: SteNicholas <programgeek@163.com>

Explicitly mark the client as timed out.

83188ce

Since ctx.close() is asynchronous, this ensures that threads checking for the client being alive get the right result.

asfgit closed this in 7cfa4c6 Nov 23, 2015

vanzin deleted the SPARK-11865 branch November 23, 2015 21:55

JoshRosen mentioned this pull request Nov 24, 2015

[SPARK-11140] [core] Transfer files using network lib when using NettyRpcEnv. #9530

Closed

hasnain-db mentioned this pull request Sep 28, 2023

[SPARK-45375][CORE] Mark connection as timedOut in TransportClient.close #43162

Closed

mridulm mentioned this pull request Mar 18, 2024

[CELEBORN-1339] Mark connection as timedOut in TransportClient.close apache/celeborn#2400

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-11865] [network] Avoid returning inactive client in TransportClientFactory. #9853

[SPARK-11865] [network] Avoid returning inactive client in TransportClientFactory. #9853

Uh oh!

vanzin commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

vanzin commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

vanzin commented Nov 23, 2015

Uh oh!

zsxwing commented Nov 23, 2015

Uh oh!

vanzin commented Nov 23, 2015

Uh oh!

zsxwing commented Nov 23, 2015

Uh oh!

SparkQA commented Nov 23, 2015

Uh oh!

vanzin commented Nov 23, 2015

Uh oh!

Uh oh!

[SPARK-11865] [network] Avoid returning inactive client in TransportClientFactory. #9853

[SPARK-11865] [network] Avoid returning inactive client in TransportClientFactory. #9853

Uh oh!

Conversation

vanzin commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

vanzin commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

vanzin commented Nov 23, 2015

Uh oh!

zsxwing commented Nov 23, 2015

Uh oh!

vanzin commented Nov 23, 2015

Uh oh!

zsxwing commented Nov 23, 2015

Uh oh!

SparkQA commented Nov 23, 2015

Uh oh!

vanzin commented Nov 23, 2015

Uh oh!

Uh oh!