Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http_listener leaks memory #1823

Closed
johnrengelman opened this issue Sep 28, 2016 · 16 comments
Closed

http_listener leaks memory #1823

johnrengelman opened this issue Sep 28, 2016 · 16 comments
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@johnrengelman
Copy link
Contributor

I'm using the http_listener input that's currently on the master branch to run a middle tier of metric collection agents. These telegraf instances are not running any other input plugins, I'm sending a good size load to the machine, ~1400 metrics/second, the metric buffer will have < 10 metrics any time it reports and I'm sending batches of size 1000 out to influx. Observing the memory profile for this instance though, it grows steadily from ~300mb at startup to over 6.5 gb before the process is killed for out of memory errors.

@johnrengelman
Copy link
Contributor Author

not sure if this is actually directly associated to http_listener or not though. I've tried playing with both metric_batch_size and metric_buffer_limit with neither having much of an affect.
I also tried adjusting interval without much help, but when I scale my number of instances in my middle tier, then the memory increase slows/stops, so it's definitely related to number of inbound metrics.

@sparrc sparrc added this to the 1.1.0 milestone Sep 28, 2016
@sparrc
Copy link
Contributor

sparrc commented Sep 28, 2016

took a quick look at the code and it could be that we are not closing the request body here: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/http_listener/http_listener.go#L114

@sparrc sparrc added the bug unexpected problem or unintended behavior label Sep 28, 2016
@johnrengelman
Copy link
Contributor Author

I'll try patching that locally. will report back in a few.

@johnrengelman
Copy link
Contributor Author

Hmmm, defer req.Body.Close() doesn't appear to have an affect. According to the documentation, it appears that explicit closing the Body is only necessary for the response body?

https://golang.org/pkg/net/http/#Request

        // Body is the request's body.
        //
        // For client requests a nil body means the request has no
        // body, such as a GET request. The HTTP Client's Transport
        // is responsible for calling the Close method.
        //
        // For server requests the Request Body is always non-nil
        // but will return EOF immediately when no body is present.
        // The Server will close the request body. The ServeHTTP
        // Handler does not need to.

@sparrc
Copy link
Contributor

sparrc commented Sep 28, 2016

yes, that looks correct, probably not that in that case

@sparrc
Copy link
Contributor

sparrc commented Sep 28, 2016

@johnrengelman can you provide any more details about the load on the system? What size batches are you using? how many http requests/s? any example data that you are writing?

@johnrengelman
Copy link
Contributor Author

johnrengelman commented Sep 29, 2016

We're using the defaults, though have tried tweaking things up and down without much change.
batch size = 1000
buffer size = 10000
interval = 10s

Here's the logging from the instance:

9/28/2016 7:10:40 PM2016/09/29 00:10:40 Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 43000. Total dropped metrics: 0.
9/28/2016 7:10:42 PM2016/09/29 00:10:42 Output [influxdb] wrote batch of 1000 metrics in 1.644209561s
9/28/2016 7:10:46 PM2016/09/29 00:10:46 Output [influxdb] wrote batch of 1000 metrics in 3.965049885s
9/28/2016 7:10:47 PM2016/09/29 00:10:47 Output [influxdb] wrote batch of 1000 metrics in 1.616036773s
9/28/2016 7:10:47 PM2016/09/29 00:10:47 Output [influxdb] wrote batch of 1000 metrics in 292.091572ms
9/28/2016 7:10:48 PM2016/09/29 00:10:48 Output [influxdb] wrote batch of 1000 metrics in 250.314547ms
9/28/2016 7:10:48 PM2016/09/29 00:10:48 Output [influxdb] wrote batch of 1000 metrics in 248.396463ms
9/28/2016 7:10:48 PM2016/09/29 00:10:48 Output [influxdb] wrote batch of 1000 metrics in 241.970001ms
9/28/2016 7:10:48 PM2016/09/29 00:10:48 Output [influxdb] wrote batch of 1000 metrics in 237.192836ms
9/28/2016 7:10:49 PM2016/09/29 00:10:49 Output [influxdb] wrote batch of 1000 metrics in 244.940211ms
9/28/2016 7:10:49 PM2016/09/29 00:10:49 Output [influxdb] wrote batch of 1000 metrics in 236.301278ms
9/28/2016 7:10:49 PM2016/09/29 00:10:49 Output [influxdb] wrote batch of 1000 metrics in 258.613349ms
9/28/2016 7:10:49 PM2016/09/29 00:10:49 Output [influxdb] wrote batch of 1000 metrics in 233.552628ms
9/28/2016 7:10:50 PM2016/09/29 00:10:50 Output [influxdb] wrote batch of 1000 metrics in 396.665383ms
9/28/2016 7:10:50 PM2016/09/29 00:10:50 Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 56000. Total dropped metrics: 0.
9/28/2016 7:10:50 PM2016/09/29 00:10:50 Output [influxdb] wrote batch of 1000 metrics in 674.296975ms
9/28/2016 7:10:52 PM2016/09/29 00:10:52 Output [influxdb] wrote batch of 1000 metrics in 1.351249435s
9/28/2016 7:10:53 PM2016/09/29 00:10:53 Output [influxdb] wrote batch of 1000 metrics in 1.378980632s
9/28/2016 7:10:54 PM2016/09/29 00:10:54 Output [influxdb] wrote batch of 1000 metrics in 791.52784ms
9/28/2016 7:10:54 PM2016/09/29 00:10:54 Output [influxdb] wrote batch of 1000 metrics in 438.960893ms
9/28/2016 7:10:55 PM2016/09/29 00:10:55 Output [influxdb] wrote batch of 1000 metrics in 286.716465ms
9/28/2016 7:10:55 PM2016/09/29 00:10:55 Output [influxdb] wrote batch of 1000 metrics in 239.88435ms
9/28/2016 7:10:55 PM2016/09/29 00:10:55 Output [influxdb] wrote batch of 1000 metrics in 239.875282ms
9/28/2016 7:10:55 PM2016/09/29 00:10:55 Output [influxdb] wrote batch of 1000 metrics in 243.32352ms
9/28/2016 7:10:56 PM2016/09/29 00:10:56 Output [influxdb] wrote batch of 1000 metrics in 262.194692ms
9/28/2016 7:10:56 PM2016/09/29 00:10:56 Output [influxdb] wrote batch of 1000 metrics in 300.382425ms
9/28/2016 7:10:56 PM2016/09/29 00:10:56 Output [influxdb] wrote batch of 1000 metrics in 266.239452ms
9/28/2016 7:10:57 PM2016/09/29 00:10:57 Output [influxdb] wrote batch of 1000 metrics in 423.428468ms
9/28/2016 7:10:57 PM2016/09/29 00:10:57 Output [influxdb] wrote batch of 1000 metrics in 267.4939ms
9/28/2016 7:10:57 PM2016/09/29 00:10:57 Output [influxdb] wrote batch of 1000 metrics in 240.087714ms
9/28/2016 7:10:58 PM2016/09/29 00:10:58 Output [influxdb] wrote batch of 1000 metrics in 385.728349ms
9/28/2016 7:10:58 PM2016/09/29 00:10:58 Output [influxdb] wrote batch of 1000 metrics in 366.220023ms
9/28/2016 7:10:58 PM2016/09/29 00:10:58 Output [influxdb] wrote batch of 1000 metrics in 306.152998ms
9/28/2016 7:10:59 PM2016/09/29 00:10:59 Output [influxdb] wrote batch of 1000 metrics in 304.323952ms
9/28/2016 7:10:59 PM2016/09/29 00:10:59 Output [influxdb] wrote batch of 1000 metrics in 381.284492ms
9/28/2016 7:11:00 PM2016/09/29 00:11:00 Output [influxdb] wrote batch of 1000 metrics in 550.000749ms
9/28/2016 7:11:00 PM2016/09/29 00:11:00 Output [influxdb] wrote batch of 1000 metrics in 349.460967ms
9/28/2016 7:11:00 PM2016/09/29 00:11:00 Output [influxdb] buffer fullness: 4 / 10000 metrics. Total gathered metrics: 78004. Total dropped metrics: 0.
9/28/2016 7:11:00 PM2016/09/29 00:11:00 Output [influxdb] wrote batch of 4 metrics in 137.776193ms

According to the load balancer stats, we're peaking around 30 req/s into the instance.

@johnrengelman
Copy link
Contributor Author

it appears the leak is related to adding the metrics to the accumulator.
When I remove this line:

t.acc.AddFields(m.Name(), m.Fields(), m.Tags(), m.Time())
then the memory does not climb.

@johnrengelman
Copy link
Contributor Author

well, what the heck. Now i went back to the current master code after testing the change above and deployed that and I'm not having issues :-/
I'm going to keep watching this, it doesn't appear that I'm getting the load that I was earlier, so I'll be curious what happens tomorrow.

sparrc added a commit that referenced this issue Sep 29, 2016
this will prevent potential very large allocations due to a very large
chunk size send from a client.

might fix all or part of #1823
@sparrc
Copy link
Contributor

sparrc commented Sep 29, 2016

@johnrengelman

@kostasb and I have both tried to reproduce the issue using "normal" batch sizes without success. We have tried with different http batch sizes and write speeds, ranging from 10,000 points/s with 10 point batches up to 80,000 points/s with 5000 point batches.

One thing we have found though is that the http_listener is currently reading in and allocating a byte buffer for the entire http request. This means that it could create very large allocations (and memory usage) if you are writing to it with very large batch sizes. I have pushed up a change to chunk the incoming http request body using a bufio.Scanner: #1826

@johnrengelman
Copy link
Contributor Author

thanks @sparrc. yeah, the behavior is odd. Like I said, I was testing a few custom build of telegraf last night and when i went back to the original, the problem wasn't exhibiting itself anymore. There are some interesting ELB metrics (which are fronting my middle tier) that I'm still wrapping my head around to understand.

I think the chance for batching http requests will be very helpful. I don think we have some fairly large metric reports from instances (especially instances that are part of our container cluster, so they are reporting metrics for all containers running on the instance)

Thanks for the additional testing. Feel free to close this issue if you want and i'll keep monitoring my system and if I see this behavior again perhaps I can capture more data.

sparrc added a commit that referenced this issue Sep 29, 2016
this will prevent potential very large allocations due to a very large
chunk size send from a client.

might fix all or part of #1823
sparrc added a commit that referenced this issue Sep 29, 2016
this will prevent potential very large allocations due to a very large
chunk size send from a client.

might fix all or part of #1823
@sparrc sparrc closed this as completed in 78ced6b Sep 29, 2016
@sparrc
Copy link
Contributor

sparrc commented Sep 29, 2016

OK, marking this as closed by #1826, as there was definitely potential there for large batches of writes to OOM the machine.

Feel free to open a new issue if you are able to reproduce after #1826

jackzampolin pushed a commit that referenced this issue Oct 7, 2016
this will prevent potential very large allocations due to a very large
chunk size send from a client.

fixes #1823
@supershal
Copy link

We also faced the same issue last week. however we were running the code which does not include #1826 fix. We are using http_listeneter as relay and routing traffic from around 500 servers with batch size of 1000. The telegraf instances were going OOM after a while and crashing.
The cpu and memory was going up immediately after a while. The logs had a lot of request timeout errors from reading body.

log.Printf("Problem reading request: [%s], Error: %s\n", string(body), err)

After some digging I found that ioutil.ReadAll(req.Body) uses 512 bytes buffer to read the payload. seems like the GC was not collecting memory back fast enough from reading the bytes.
This issue can be reproduced by reducing read timeout to HttpListener.

You will have to change the code where it sets minimum read timeout to 10s.
if t.ReadTimeout.Duration < time.Second { t.ReadTimeout.Duration = time.Second * 10 }

I just updated my fork with #1826 and I will be testing the fix today. If problem persists, I will open new issue.
Also, I tried to reproduce it with unit test but my unit test was just hanging. Trying to figure out to write unit test for testing http timeouts.

PS: anyone know why timout errors are ignored in stoppableListener and what would be impact of ignoring them?
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/http_listener/stoppableListener/listener.go#L51

@johnrengelman
Copy link
Contributor Author

@supershal we would see that read timeout error message as well in our logs. I think you're on to something there.

@sparrc
Copy link
Contributor

sparrc commented Oct 10, 2016

nice sleuthing @supershal 👍 , let us know what you find

@supershal
Copy link

@sparrc @johnrengelman I tested #1826 with large batch size. I ran into same metric parsing issue as #1856 . I think the parsing issues were result of the bugfix.

I have submitted #1892.
This fix handles the incoming requests the same way as influxdb's HTTP write handler. It reads content-length header and preallocates buffer with that size and reads all data before parsing them using scanner. We have done some stress tests on the fix and running the fixed version since two days without any issue. The telegraf instances does not seems to leak memory and memory usage remains low. All metrics are parsing properly. There is no data drop.

You might have to tune plugin's read timeout depending on incoming load.

Since the plugin is used for batching incoming the metrics, you have to make sure all incoming metrics has timestamp with nanosecond precision. I spent most of the time figuring out "data loss" because of timestamp precision. Some of my instances were running older telegraf (0.13) and sending timestamp with second precision. However the latest telegraf version which i am using for relay, ignores the precision and only sends precision as ns.
This caused a lot of metrics appeared to be dropped from influxdb. We spent too much time finding this root cause. :) .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants