Skip to content

[SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load classes (master branch PR) #4944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

JoshRosen
Copy link
Contributor

ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See SPARK-6209 for more details, including a bug reproduction.

This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases.

This PR is an extended version of #4935 and adds a regression test.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28378 has started for PR 4944 at commit 961c284.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28378 has finished for PR 4944 at commit 961c284.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28378/
Test PASSed.

@JoshRosen
Copy link
Contributor Author

@mccheah
Copy link
Contributor

mccheah commented Mar 11, 2015

This looks like the right thing to do. Thanks for submitting this!

@ash211 @mingyukim care to take a look?

connection.setConnectTimeout(httpUrlConnectionTimeoutMillis)
connection.setReadTimeout(httpUrlConnectionTimeoutMillis)
}
connection.connect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with this API - but can this connection object ever leak even if the returned input-stream is closed? Or, can an exception occur before returning from this method and the connection leak that way? I'm guessing these aren't possible but just wanted to educate myself as to how these scenarios won't happen =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the HttpURLConnection Javadoc:

Each HttpURLConnection instance is used to make a single request but the underlying network connection to the HTTP server may be transparently shared by other instances. Calling the close() methods on the InputStream or OutputStream of an HttpURLConnection after a request may free network resources associated with this instance but has no effect on any shared persistent connection. Calling the disconnect() method may close the underlying socket if a persistent connection is otherwise idle at that time.

This can cause problems in practice: https://scotte.github.io/2015/01/httpurlconnection-socket-leak/.

Actually, that blog post reminds me that I should probably call getErrorStream and read that stream in the next error-handling block prior to calling disconnect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On closer inspection, it looks like you need to consume the errorStream if you want the underlying connection to be eligible for re-use, but we may not necessarily care about that here: http://docs.oracle.com/javase/6/docs/technotes/guides/net/http-keepalive.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if an exception throws before we return the input stream? In that case the input stream is not closed but perhaps the connection is still open?

I'm basically asking, should we have a catch block for Throwable in this method after we instantiate the connection to close the connection if we run into an Exception before returning the input stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the connect() call is unlikely to fail in ways that leak resources, but suppose it wouldn't hurt to add another try-catch block just to be safe. I can do this tomorrow.

@SparkQA
Copy link

SparkQA commented Mar 15, 2015

Test build #28633 has started for PR 4944 at commit e0e3c25.

  • This patch merges cleanly.

@JoshRosen
Copy link
Contributor Author

I've updated this with a new commit that adds another layer of try-catch blocks to handle errors when reading the status code. I also made a slight change to how we handle 404 errors: we now attempt to close the error stream and only call disconnect() as a last-resort in case the error stream cleanup fails. This allows us to continue to benefit from HTTP keep-alive / persistent connections even when errors occur (I verified this behavior using Wireshark).

@SparkQA
Copy link

SparkQA commented Mar 15, 2015

Test build #28633 has finished for PR 4944 at commit e0e3c25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28633/
Test PASSed.

@JoshRosen
Copy link
Contributor Author

I think this should be good to merge, but I'm going to leave it open for a little while longer in order to allow others to comment.

@JoshRosen
Copy link
Contributor Author

Ping @andrewor14 (as requested).

@mccheah
Copy link
Contributor

mccheah commented Mar 24, 2015

Yeah it looks okay to me but I also would feel more comfortable if a second core committer took a look.

@andrewor14
Copy link
Contributor

LGTM

@andrewor14
Copy link
Contributor

This goes into master and 1.3.

asfgit pushed a commit that referenced this pull request Mar 24, 2015
…g to load classes (master branch PR)

ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang.  See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction.

This patch fixes this issue by ensuring proper cleanup of these resources.  It also adds logging for unexpected error cases.

This PR is an extended version of #4935 and adds a regression test.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits:

e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream
961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail
7ee2261 [Josh Rosen] Add a failing regression test
e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader

(cherry picked from commit 7215aa7)
Signed-off-by: Andrew Or <andrew@databricks.com>
@asfgit asfgit closed this in 7215aa7 Mar 24, 2015
@JoshRosen
Copy link
Contributor Author

Thanks for the review. I'll open a separate PR for branch-1.2 (1.2.2).

JoshRosen added a commit to JoshRosen/spark that referenced this pull request Mar 24, 2015
…g to load classes (master branch PR)

ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang.  See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction.

This patch fixes this issue by ensuring proper cleanup of these resources.  It also adds logging for unexpected error cases.

This PR is an extended version of apache#4935 and adds a regression test.

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits:

e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream
961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail
7ee2261 [Josh Rosen] Add a failing regression test
e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader

(cherry picked from commit 7215aa7)
Signed-off-by: Andrew Or <andrew@databricks.com>

Conflicts:
	repl/pom.xml
	repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
@JoshRosen
Copy link
Contributor Author

I've opened #5174 for branch-1.2 (1.2.2).

asfgit pushed a commit that referenced this pull request Apr 5, 2015
…g to load classes (branch-1.2)

ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang.  See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction.

This patch fixes this issue by ensuring proper cleanup of these resources.  It also adds logging for unexpected error cases.

(See #4944 for the corresponding PR for 1.3/1.4).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5174 from JoshRosen/executorclassloaderleak-branch-1.2 and squashes the following commits:

16e38fe [Josh Rosen] [SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load classes (master branch PR)
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Apr 15, 2015
…g to load classes (branch-1.2)

ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang.  See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction.

This patch fixes this issue by ensuring proper cleanup of these resources.  It also adds logging for unexpected error cases.

(See apache#4944 for the corresponding PR for 1.3/1.4).

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#5174 from JoshRosen/executorclassloaderleak-branch-1.2 and squashes the following commits:

16e38fe [Josh Rosen] [SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load classes (master branch PR)
dongjoon-hyun pushed a commit that referenced this pull request Mar 4, 2025
…imeoutMillis` from `ExecutorClassLoader`

### What changes were proposed in this pull request?
This pr aims to remove unused `private[executor] var httpUrlConnectionTimeoutMillis` from `ExecutorClassLoader`.

The definition of `httpUrlConnectionTimeoutMillis` was introduced in SPARK-6209 (#4944) and was used by the function `getClassFileInputStreamFromHttpServer`. After SPARK-23538(#20723), the function `getClassFileInputStreamFromHttpServer` was removed from `ExecutorClassLoader`, and `httpUrlConnectionTimeoutMillis` is no longer used either.

### Why are the changes needed?
Code cleanup

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Action

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #50152 from LuciferYang/minor-ExecutorClassLoader-var.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Pajaraja pushed a commit to Pajaraja/spark that referenced this pull request Mar 6, 2025
…imeoutMillis` from `ExecutorClassLoader`

### What changes were proposed in this pull request?
This pr aims to remove unused `private[executor] var httpUrlConnectionTimeoutMillis` from `ExecutorClassLoader`.

The definition of `httpUrlConnectionTimeoutMillis` was introduced in SPARK-6209 (apache#4944) and was used by the function `getClassFileInputStreamFromHttpServer`. After SPARK-23538(apache#20723), the function `getClassFileInputStreamFromHttpServer` was removed from `ExecutorClassLoader`, and `httpUrlConnectionTimeoutMillis` is no longer used either.

### Why are the changes needed?
Code cleanup

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Action

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#50152 from LuciferYang/minor-ExecutorClassLoader-var.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
anoopj pushed a commit to anoopj/spark that referenced this pull request Mar 15, 2025
…imeoutMillis` from `ExecutorClassLoader`

### What changes were proposed in this pull request?
This pr aims to remove unused `private[executor] var httpUrlConnectionTimeoutMillis` from `ExecutorClassLoader`.

The definition of `httpUrlConnectionTimeoutMillis` was introduced in SPARK-6209 (apache#4944) and was used by the function `getClassFileInputStreamFromHttpServer`. After SPARK-23538(apache#20723), the function `getClassFileInputStreamFromHttpServer` was removed from `ExecutorClassLoader`, and `httpUrlConnectionTimeoutMillis` is no longer used either.

### Why are the changes needed?
Code cleanup

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Action

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#50152 from LuciferYang/minor-ExecutorClassLoader-var.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants