-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http_listener leaks memory #1823
Comments
not sure if this is actually directly associated to |
took a quick look at the code and it could be that we are not closing the request body here: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/http_listener/http_listener.go#L114 |
I'll try patching that locally. will report back in a few. |
Hmmm, https://golang.org/pkg/net/http/#Request
|
yes, that looks correct, probably not that in that case |
@johnrengelman can you provide any more details about the load on the system? What size batches are you using? how many http requests/s? any example data that you are writing? |
We're using the defaults, though have tried tweaking things up and down without much change. Here's the logging from the instance:
According to the load balancer stats, we're peaking around 30 req/s into the instance. |
it appears the leak is related to adding the metrics to the accumulator.
|
well, what the heck. Now i went back to the current master code after testing the change above and deployed that and I'm not having issues :-/ |
this will prevent potential very large allocations due to a very large chunk size send from a client. might fix all or part of #1823
@kostasb and I have both tried to reproduce the issue using "normal" batch sizes without success. We have tried with different http batch sizes and write speeds, ranging from 10,000 points/s with 10 point batches up to 80,000 points/s with 5000 point batches. One thing we have found though is that the http_listener is currently reading in and allocating a byte buffer for the entire http request. This means that it could create very large allocations (and memory usage) if you are writing to it with very large batch sizes. I have pushed up a change to chunk the incoming http request body using a |
thanks @sparrc. yeah, the behavior is odd. Like I said, I was testing a few custom build of telegraf last night and when i went back to the original, the problem wasn't exhibiting itself anymore. There are some interesting ELB metrics (which are fronting my middle tier) that I'm still wrapping my head around to understand. I think the chance for batching http requests will be very helpful. I don think we have some fairly large metric reports from instances (especially instances that are part of our container cluster, so they are reporting metrics for all containers running on the instance) Thanks for the additional testing. Feel free to close this issue if you want and i'll keep monitoring my system and if I see this behavior again perhaps I can capture more data. |
this will prevent potential very large allocations due to a very large chunk size send from a client. might fix all or part of #1823
this will prevent potential very large allocations due to a very large chunk size send from a client. might fix all or part of #1823
this will prevent potential very large allocations due to a very large chunk size send from a client. fixes #1823
We also faced the same issue last week. however we were running the code which does not include #1826 fix. We are using http_listeneter as relay and routing traffic from around 500 servers with batch size of 1000. The telegraf instances were going OOM after a while and crashing.
After some digging I found that ioutil.ReadAll(req.Body) uses 512 bytes buffer to read the payload. seems like the GC was not collecting memory back fast enough from reading the bytes. You will have to change the code where it sets minimum read timeout to 10s. I just updated my fork with #1826 and I will be testing the fix today. If problem persists, I will open new issue. PS: anyone know why timout errors are ignored in stoppableListener and what would be impact of ignoring them? |
@supershal we would see that read timeout error message as well in our logs. I think you're on to something there. |
nice sleuthing @supershal 👍 , let us know what you find |
@sparrc @johnrengelman I tested #1826 with large batch size. I ran into same metric parsing issue as #1856 . I think the parsing issues were result of the bugfix. I have submitted #1892. You might have to tune plugin's read timeout depending on incoming load. Since the plugin is used for batching incoming the metrics, you have to make sure all incoming metrics has timestamp with nanosecond precision. I spent most of the time figuring out "data loss" because of timestamp precision. Some of my instances were running older telegraf (0.13) and sending timestamp with second precision. However the latest telegraf version which i am using for relay, ignores the precision and only sends precision as |
I'm using the
http_listener
input that's currently on the master branch to run a middle tier of metric collection agents. These telegraf instances are not running any other input plugins, I'm sending a good size load to the machine, ~1400 metrics/second, the metric buffer will have < 10 metrics any time it reports and I'm sending batches of size 1000 out to influx. Observing the memory profile for this instance though, it grows steadily from ~300mb at startup to over 6.5 gb before the process is killed for out of memory errors.The text was updated successfully, but these errors were encountered: