Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher memory usage in pre-0.12 version using HTTP2 compared to h2 branch #1491

Closed
pimeys opened this issue Apr 18, 2018 · 13 comments
Closed
Labels
A-client Area: client. C-bug Category: bug. Something is wrong. This is bad!

Comments

@pimeys
Copy link

pimeys commented Apr 18, 2018

We've been using the h2 branch since the early stages, a couple of days ago upgrading to the master pre-0.12. Something in the master branch uses more RAM than the old version, possibly the new pool. Our use case is many concurrent open connections, keeping them alive and using http2.

Here's an image showing the regression:

hyper_ram

The deployment was 16.4.2018 around 11am local time. Also the noticeable growth happens when we hit the keep-alive timeout and get lots of requests through a new connection. So the culprit might be how we dispose the dead connections in the pool.

@seanmonstar
Copy link
Member

Is this http1 or 2 connections?

@pimeys
Copy link
Author

pimeys commented Apr 18, 2018

Http2. I checked what was updated after the deployment:

hyper from h2 branch to 0.12.0-pre.0 (HEAD), tokio-core 0.1.16 replaced with tokio 0.1.5. So there's quite minimal changes in the deployment, that's why the main culprit I can think of is hyper's new Pool implementation.

We'll monitor how it looks like after couple of days, will it settle into some memory usage or keep growing.

@seanmonstar
Copy link
Member

If at all possible, logs from hyper::client::pool,hyper::proto::h2 would probably also help.

@pimeys
Copy link
Author

pimeys commented Apr 18, 2018

Ok, so I'm having this running and logging with tee some hours with production traffic on a test machine.

@pimeys
Copy link
Author

pimeys commented Apr 18, 2018

First logs, if you need more trace points, please let me know. I could run this longer, but right now the production instances freed some memory, about 10% drop, so at least they're not going to OOM. Just something seems to use more RAM than before...

With long keep-alive timeout, reusing connections:

TRACE 2018-04-18T18:59:05Z: hyper::client::pool: checkout waiting for idle connection: ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T18:59:05Z: hyper::client::connect: Http::connect(https://api.push.apple.com/3/device/###########)
DEBUG 2018-04-18T18:59:05Z: hyper::client::dns: resolving host="api.push.apple.com", port=443
DEBUG 2018-04-18T18:59:05Z: hyper::client::connect: connecting to 17.188.130.151:443
TRACE 2018-04-18T18:59:06Z: hyper::client::pool: Pool::put ("https://api.push.apple.com", Http2)
DEBUG 2018-04-18T18:59:06Z: hyper::client::pool: pooling idle connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T18:59:06Z: hyper::proto::h2: send body chunk: 404B
TRACE 2018-04-18T18:59:06Z: hyper::proto::h2: send body eos
TRACE 2018-04-18T18:59:06Z: hyper::client::pool: Pool::put; existing idle HTTP/2 connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T18:59:42Z: hyper::client::pool: take? ("https://api.push.apple.com", Http2): expiration = Some(Duration { secs: 600, nanos: 0 })
DEBUG 2018-04-18T18:59:42Z: hyper::client::pool: reuse idle connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T18:59:42Z: hyper::proto::h2: send body chunk: 660B
TRACE 2018-04-18T18:59:42Z: hyper::proto::h2: send body eos
TRACE 2018-04-18T18:59:42Z: hyper::client::pool: Pool::put; existing idle HTTP/2 connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:00:13Z: hyper::client::pool: take? ("https://api.push.apple.com", Http2): expiration = Some(Duration { secs: 600, nanos: 0 })
DEBUG 2018-04-18T19:00:13Z: hyper::client::pool: reuse idle connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:00:13Z: hyper::proto::h2: send body chunk: 480B
TRACE 2018-04-18T19:00:13Z: hyper::proto::h2: send body eos
TRACE 2018-04-18T19:00:13Z: hyper::client::pool: Pool::put; existing idle HTTP/2 connection for ("https://api.push.apple.com", Http2)

With short keep-alive timeout, creating new connections:

TRACE 2018-04-18T19:06:44Z: hyper::client::pool: checkout waiting for idle connection: ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:06:44Z: hyper::client::connect: Http::connect(https://api.push.apple.com/3/device/############)
DEBUG 2018-04-18T19:06:44Z: hyper::client::dns: resolving host="api.push.apple.com", port=443
DEBUG 2018-04-18T19:06:44Z: hyper::client::connect: connecting to 17.188.138.24:443
TRACE 2018-04-18T19:06:45Z: hyper::client::pool: Pool::put ("https://api.push.apple.com", Http2)
DEBUG 2018-04-18T19:06:45Z: hyper::client::pool: pooling idle connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:06:45Z: hyper::proto::h2: send body chunk: 440B
TRACE 2018-04-18T19:06:45Z: hyper::proto::h2: send body eos
TRACE 2018-04-18T19:06:45Z: hyper::client::pool: Pool::put; existing idle HTTP/2 connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:07:04Z: hyper::proto::h2::client: client tx dropped
TRACE 2018-04-18T19:07:24Z: hyper::client::pool: checkout waiting for idle connection: ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:07:24Z: hyper::client::connect: Http::connect(https://api.push.apple.com/3/device/##############)
DEBUG 2018-04-18T19:07:24Z: hyper::client::dns: resolving host="api.push.apple.com", port=443
DEBUG 2018-04-18T19:07:24Z: hyper::client::connect: connecting to 17.188.138.24:443
TRACE 2018-04-18T19:07:25Z: hyper::client::pool: Pool::put ("https://api.push.apple.com", Http2)
DEBUG 2018-04-18T19:07:25Z: hyper::client::pool: pooling idle connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:07:25Z: hyper::proto::h2: send body chunk: 404B
TRACE 2018-04-18T19:07:25Z: hyper::proto::h2: send body eos
TRACE 2018-04-18T19:07:25Z: hyper::client::pool: Pool::put; existing idle HTTP/2 connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:07:42Z: hyper::client::pool: take? ("https://api.push.apple.com", Http2): expiration = Some(Duration { secs: 10, nanos: 0 })
TRACE 2018-04-18T19:07:42Z: hyper::client::pool: remove unacceptable pooled connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:07:42Z: hyper::client::pool: checkout waiting for idle connection: ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:07:42Z: hyper::proto::h2::client: client tx dropped
TRACE 2018-04-18T19:07:42Z: hyper::client::connect: Http::connect(https://api.push.apple.com/3/device/#######################)
DEBUG 2018-04-18T19:07:42Z: hyper::client::dns: resolving host="api.push.apple.com", port=443
DEBUG 2018-04-18T19:07:42Z: hyper::client::connect: connecting to 17.188.138.24:443
TRACE 2018-04-18T19:07:42Z: hyper::client::pool: Pool::put ("https://api.push.apple.com", Http2)
DEBUG 2018-04-18T19:07:42Z: hyper::client::pool: pooling idle connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:07:42Z: hyper::proto::h2: send body chunk: 440B
TRACE 2018-04-18T19:07:42Z: hyper::proto::h2: send body eos
TRACE 2018-04-18T19:07:43Z: hyper::client::pool: Pool::put; existing idle HTTP/2 connection for ("https://api.push.apple.com", Http2)
TRACE 2018-04-18T19:07:54Z: hyper::proto::h2::client: client tx dropped
TRACE 2018-04-18T19:07:56Z: hyper::client::pool: pool closed, canceling idle interval

@seanmonstar
Copy link
Member

The only thing I can think of is that something is holding on a reference to the h2 streams type, and perhaps being stuck in a future that doesn't poll as finished. And then eventually, Apple closes the connection and that triggers the internal h2 streams type to clean itself up.

I've pushed some minor clean up to master, along with some clearer logs, if you want to take a look.

@seanmonstar seanmonstar added C-bug Category: bug. Something is wrong. This is bad! A-client Area: client. labels Apr 18, 2018
@pimeys
Copy link
Author

pimeys commented Apr 20, 2018

So some notes on the systems we're running with Hyper:

  • The one with h2 connections, one per customer. We use long keep-alive timeouts, and basically the system grabs more memory without freeing when we reuse an idle connection that needs to be reopened.
  • We also run a system with Hyper 0.11, using only h1 connections. This time we're just holding one pool and all customers reuse the same connections. We keep them alive, but don't see same memory issues as with h2. My guess though is, that the smaller amount of connections and the constant traffic through them being generic actually keeps them up all the time, unlike the h2 ones that can be idle the whole night and in the morning get more traffic, reopening them and causing the issue.

I can try Hyper 0.12 in our h1 system, but I really doubt I see bigger memory usage.

@pimeys
Copy link
Author

pimeys commented Apr 20, 2018

Oh, one more observation:

When we get a jump in RAM usage, it happens when a dead connection gets a big spike of requests. The memory leak is about 60-100M, meaning it's much more than a couple of dead connections (should be about 6 per customer). I'd say something is not cleaned when we race the new connection in the pool, but I'm not sure.

@seanmonstar
Copy link
Member

it happens when a dead connection gets a big spike of requests

What determines it is dead? The remote has closed the connection, but the pool hadn't yet cleared out the handle? Or it's reached its expiration and was waiting for the idle interval to clean it up?

@pimeys
Copy link
Author

pimeys commented Apr 21, 2018

What determines it is dead? The remote has closed the connection, but the pool hadn't yet cleared out the handle? Or it's reached its expiration and was waiting for the idle interval to clean it up?

No, I mean I have a keep-alive timeout of 10 minutes, and we have some customers that don't use their connection for the night and then just blast a massive amount of requests, so the pools open new connections and we have like hundreds of thousands of concurrent requests polling for new connections.

But like I said in IRC, I don't think this is an issue now really to block 0.12. We don't OOM, the memory is freed eventually and the difference to h2 0.11 branch is just doubled RAM usage, which is maybe not nice but we're talking about some hundreds of megabytes in our case.

@pimeys
Copy link
Author

pimeys commented Apr 22, 2018

Monitoring the latest 0.12 for the weekend, the memory usage is bigger but doesn't really grow above certain limit. Also we notice drops every now and then, evening it out. The graph starts from a deployment and ends to the current timestamp.

pattern

@pimeys
Copy link
Author

pimeys commented May 2, 2018

The RAM usage was reduced by several commits before, but this being the best deal and now the usage actually even gets lower when running the system longer. I consider this fixed now.

@pimeys pimeys closed this as completed May 2, 2018
@seanmonstar
Copy link
Member

For reference, it was noticed that Weak::new() allocates space for the thing that the related Arc<T> would hold, even though Weak::new() can never upgrade to a strong reference by design. The fix in hyper was to create a single Weak::new() per pool, and clone that instead of creating Weak::new()s for every single waiter.

I've open a PR to libstd to fix this easy to miss issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-client Area: client. C-bug Category: bug. Something is wrong. This is bad!
Projects
None yet
Development

No branches or pull requests

2 participants