-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Metric buffer overflow" with telegraf 1.19.1, not on version 1.17.0. #9531
Comments
Can you please post the exact error message on v1.19.1. Furthermore, does your setup work with v1.18.3? |
We get messages like these "Metric buffer overflow; 2584 metrics have been dropped". We tried increasing the buffer limit to 20000 and even that didn't help. The batch size is currently set to 1000 and the interval is set to 10 seconds. Do you recommend upgrading to v1.18.3 to see if that helps? Haven't tried with v1.18.3 yet. |
Those things usually only happen if an output cannot flush the metrics to the sink. Do you see any output errors in the logs? What outputs did you configure? Let's first see if we can track down the problem with v1.19 and if nothing helps we try v1.18 as a last resort. ;-) |
I also seen overflow dropped metrics logs. only 1 that server isn't flush buffers
|
We are seeing the same issue on 1.19.2. We have had some success by setting the flush_interval less than the interval, i.e.
There used to be a warning about that, but seems to be obsolete Based on @whyjp's debug logs, I'm guessing we have the same issue and reducing the flush interval is working because when we had a flush_interval at 10s, our 10000 metric_batch_size is governing the writes and can't keep up with the input. Reducing the flush interval means that we are probably now flushing all metrics every 5s instead of flushing the metric_batch_size as fast as we can. So, upping the metric_batch_size might be another way to go. |
@arjunkchr, @whyjp, @dindurthy : I currently cannot reproduce your setup without further information... |
Now some other things have changed on that server (not related to telegraf) |
We are seeing a similar issue where the buffer isn't being flushed properly after upgrading to telegraf 1.19.1 when there is a high metric throughput relative to |
@gnjack ok we now have
However, this is a huuuuge chunk of code, so it would be interesting of last 1.18 works for you... |
I think the buffer flush behaviour under heavy load (full batches always available before the flush interval passes) would be fixed by changing this line from Line 811 in 95ef674
This would mean an early flush would flush all available batches, instead of just a single batch, just like the other flush triggers (interval / shutdown). |
Solved by #9800 |
Recently, we upgraded telegraf version from 1.17.0 and 1.19.1 and increased the buffer limit for telegraf.
Telegraf has wavefront output plugin configured.
On v1.17.0, we do not see "Metric buffer overflow" and the metrics are reported to wavefront correctly.
On v1.19.1, the metrics is not showing up in Wavefront correctly and we continue to see "Metric buffer overflow" being reported by telegraf.
The text was updated successfully, but these errors were encountered: