Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionPool cleanup is eager and results in unnecessary connection churn and (potentially) port exhaustion #4354

Closed
nicktrav opened this issue Nov 1, 2018 · 8 comments
Labels
documentation Documentation task enhancement Feature not a bug
Milestone

Comments

@nicktrav
Copy link

nicktrav commented Nov 1, 2018

Currently, OkHttp reaps "idle" connections (i.e. connections not in use) if either of the following is true:

  • a connection has not been used in a period greater than the idle timeout, or
  • the number of idle connections is greater than some threshold
      if (longestIdleDurationNs >= this.keepAliveDurationNs
          || idleConnectionCount > this.maxIdleConnections) {
        // We've found a connection to evict. Remove it from the list, then close it below (outside
        // of the synchronized block).
        connections.remove(longestIdleConnection);
      } 

if (longestIdleDurationNs >= this.keepAliveDurationNs
|| idleConnectionCount > this.maxIdleConnections) {
// We've found a connection to evict. Remove it from the list, then close it below (outside
// of the synchronized block).
connections.remove(longestIdleConnection);

This policy favors keeping the number of open connections close to the maxIdleConnections value.

We've run into situations internally where we have multiple threads using a single OkHttpClient instance to issue concurrent requests to an upstream (roughly 60 threads), but we observe that many of the connections (and underlying TCP sockets) are closed very soon after their initial use. This eventually leads to a situation where the host may run out of ephemeral ports (as the sockets are placed into TIME_WAIT for a period of time, and the port cannot be reused). Eventually connect() syscalls fail.

One workaround is to set the maxIdleConnections value to some value sufficiently high enough to allow for the connections to be reused in high request-rate environments. However, these connections will not be closed in the case that the request rate falls.

Maybe this is by design? I'd posit that the principle of least surprise would have one assume that a connection is only closed if it has been idle for some extended period of time. Indeed that has caught a few of us by surprise.

An alternative we'd like to propose is that connections are closed only if an eligible connection exists and the maxIdleConnections value has been exceeded. i.e. change the OR to and AND in the code above. This would allow the connection pool to grow with the load during bursts, and to shrink when the connections are not being utilized.

Here's a small reproducer we put together where we observe the port exhaustion issue:
https://github.com/nicktrav/okhttp/commits/nickt.time-wait-issue

This has also been mentioned here, but the comments didn't seem to point at a resolution:
https://stackoverflow.com/questions/41011287/why-okhttp-doesnt-reuse-its-connections

@nicktrav
Copy link
Author

nicktrav commented Nov 1, 2018

cc: @rpeng

@swankjesse
Copy link
Member

We really need to close connections that have exceeded the keep alive duration. We evict these connections not to conserve resources but to prevent failures later caused by use of stale connections.

I think the right fix here is to dramatically increase the maxIdleConnections; perhaps to 1,000 or more. Closing non-stale connections is merely about conserving memory, and memory is cheap! Probably about ~64 KiB per socket, depending on system buffer sizes.

And we should change the default value in OkHttp also. Perhaps a default of 64? That would be 4 MiB of memory if my socket math is right, which seems in the right order of magnitude.

@swankjesse
Copy link
Member

Next steps:

  • Measure on Android and JVM how much memory an idle socket needs.
  • Set a target idle connection count based on that per-socket memory requirement. Probably between 1 and 8 MiB is the right range.
  • Document this change and announce it loudly. Provide guidance for users on what they should use if not our new defaults.

@swankjesse
Copy link
Member

swankjesse commented Nov 2, 2018

One other secondary factor: each socket also has a cost on the server as well. Growing our pools will grow memory used by servers!

@nicktrav
Copy link
Author

nicktrav commented Nov 2, 2018

I definitely agree with you for the most part on your proposals!

One part I’d push back on a little is:

Closing non-stale connections is merely about conserving memory, and memory is cheap!

We’re seeing that’s not really the case for our clients, as a) it eats up all of the ports on the server the machine is running, given we’re opening tons of sockets and then closing them immediately (the ports can’t be reused as they’re in TIME_WAIT), b) there’s a non-trivial CPU cost here for new sockets that have to negotiate TLS when doing HTTPS.

Maybe the docs just need to be a little clearer? It took a while to work out why the sockets were immediately closed, and that was a little surprising, given experience at least with other servers that tend to close when idle conns > max value AND conn has been idle for a while.

For now, at least, we can get around this on our end by bumping up the max idle connection count, as you suggest.

@rpeng
Copy link

rpeng commented Nov 2, 2018

@swankjesse I think the defaults here are reasonable for mobile clients - i.e. I doubt an android app would really require more than 5 simultaneous connections anywhere.

However, for servers usage, there are a lot less memory / battery life constraints that we need to worry about. It might be a good idea to have a way to have different sane defaults for separate use cases (i.e. client side vs. server side, similar to how java has -server and -client modes).

In reality, every user is going to have a somewhat different use case, and would likely need to run some profiling to figure out the ideal idle size and keep-alive time to come up with reasonable values.

@swankjesse swankjesse added this to the 3.12 milestone Nov 4, 2018
@swankjesse swankjesse added enhancement Feature not a bug documentation Documentation task labels Nov 4, 2018
@swankjesse
Copy link
Member

Punting to the next release ’cause there’s no code changes here.

@yschimke
Copy link
Collaborator

Closing as effectively a dupe of #4530

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation task enhancement Feature not a bug
Projects
None yet
Development

No branches or pull requests

4 participants