-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load classes (master branch PR) #4944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load classes (master branch PR) #4944
Conversation
Test build #28378 has started for PR 4944 at commit
|
Test build #28378 has finished for PR 4944 at commit
|
Test PASSed. |
This looks like the right thing to do. Thanks for submitting this! @ash211 @mingyukim care to take a look? |
connection.setConnectTimeout(httpUrlConnectionTimeoutMillis) | ||
connection.setReadTimeout(httpUrlConnectionTimeoutMillis) | ||
} | ||
connection.connect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too familiar with this API - but can this connection object ever leak even if the returned input-stream is closed? Or, can an exception occur before returning from this method and the connection leak that way? I'm guessing these aren't possible but just wanted to educate myself as to how these scenarios won't happen =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the HttpURLConnection Javadoc:
Each HttpURLConnection instance is used to make a single request but the underlying network connection to the HTTP server may be transparently shared by other instances. Calling the close() methods on the InputStream or OutputStream of an HttpURLConnection after a request may free network resources associated with this instance but has no effect on any shared persistent connection. Calling the disconnect() method may close the underlying socket if a persistent connection is otherwise idle at that time.
This can cause problems in practice: https://scotte.github.io/2015/01/httpurlconnection-socket-leak/.
Actually, that blog post reminds me that I should probably call getErrorStream
and read that stream in the next error-handling block prior to calling disconnect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On closer inspection, it looks like you need to consume the errorStream if you want the underlying connection to be eligible for re-use, but we may not necessarily care about that here: http://docs.oracle.com/javase/6/docs/technotes/guides/net/http-keepalive.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if an exception throws before we return the input stream? In that case the input stream is not closed but perhaps the connection is still open?
I'm basically asking, should we have a catch block for Throwable in this method after we instantiate the connection to close the connection if we run into an Exception before returning the input stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the connect()
call is unlikely to fail in ways that leak resources, but suppose it wouldn't hurt to add another try-catch
block just to be safe. I can do this tomorrow.
Test build #28633 has started for PR 4944 at commit
|
I've updated this with a new commit that adds another layer of |
Test build #28633 has finished for PR 4944 at commit
|
Test PASSed. |
I think this should be good to merge, but I'm going to leave it open for a little while longer in order to allow others to comment. |
Ping @andrewor14 (as requested). |
Yeah it looks okay to me but I also would feel more comfortable if a second core committer took a look. |
LGTM |
This goes into master and 1.3. |
…g to load classes (master branch PR) ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction. This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases. This PR is an extended version of #4935 and adds a regression test. Author: Josh Rosen <joshrosen@databricks.com> Closes #4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits: e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream 961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail 7ee2261 [Josh Rosen] Add a failing regression test e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader (cherry picked from commit 7215aa7) Signed-off-by: Andrew Or <andrew@databricks.com>
Thanks for the review. I'll open a separate PR for |
…g to load classes (master branch PR) ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction. This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases. This PR is an extended version of apache#4935 and adds a regression test. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits: e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream 961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail 7ee2261 [Josh Rosen] Add a failing regression test e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader (cherry picked from commit 7215aa7) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: repl/pom.xml repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala
I've opened #5174 for |
…g to load classes (branch-1.2) ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction. This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases. (See #4944 for the corresponding PR for 1.3/1.4). Author: Josh Rosen <joshrosen@databricks.com> Closes #5174 from JoshRosen/executorclassloaderleak-branch-1.2 and squashes the following commits: 16e38fe [Josh Rosen] [SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load classes (master branch PR)
…g to load classes (branch-1.2) ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction. This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases. (See apache#4944 for the corresponding PR for 1.3/1.4). Author: Josh Rosen <joshrosen@databricks.com> Closes apache#5174 from JoshRosen/executorclassloaderleak-branch-1.2 and squashes the following commits: 16e38fe [Josh Rosen] [SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load classes (master branch PR)
…imeoutMillis` from `ExecutorClassLoader` ### What changes were proposed in this pull request? This pr aims to remove unused `private[executor] var httpUrlConnectionTimeoutMillis` from `ExecutorClassLoader`. The definition of `httpUrlConnectionTimeoutMillis` was introduced in SPARK-6209 (#4944) and was used by the function `getClassFileInputStreamFromHttpServer`. After SPARK-23538(#20723), the function `getClassFileInputStreamFromHttpServer` was removed from `ExecutorClassLoader`, and `httpUrlConnectionTimeoutMillis` is no longer used either. ### Why are the changes needed? Code cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Action ### Was this patch authored or co-authored using generative AI tooling? No Closes #50152 from LuciferYang/minor-ExecutorClassLoader-var. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…imeoutMillis` from `ExecutorClassLoader` ### What changes were proposed in this pull request? This pr aims to remove unused `private[executor] var httpUrlConnectionTimeoutMillis` from `ExecutorClassLoader`. The definition of `httpUrlConnectionTimeoutMillis` was introduced in SPARK-6209 (apache#4944) and was used by the function `getClassFileInputStreamFromHttpServer`. After SPARK-23538(apache#20723), the function `getClassFileInputStreamFromHttpServer` was removed from `ExecutorClassLoader`, and `httpUrlConnectionTimeoutMillis` is no longer used either. ### Why are the changes needed? Code cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Action ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50152 from LuciferYang/minor-ExecutorClassLoader-var. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…imeoutMillis` from `ExecutorClassLoader` ### What changes were proposed in this pull request? This pr aims to remove unused `private[executor] var httpUrlConnectionTimeoutMillis` from `ExecutorClassLoader`. The definition of `httpUrlConnectionTimeoutMillis` was introduced in SPARK-6209 (apache#4944) and was used by the function `getClassFileInputStreamFromHttpServer`. After SPARK-23538(apache#20723), the function `getClassFileInputStreamFromHttpServer` was removed from `ExecutorClassLoader`, and `httpUrlConnectionTimeoutMillis` is no longer used either. ### Why are the changes needed? Code cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Action ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50152 from LuciferYang/minor-ExecutorClassLoader-var. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See SPARK-6209 for more details, including a bug reproduction.
This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases.
This PR is an extended version of #4935 and adds a regression test.