"Metric buffer overflow" with telegraf 1.19.1, not on version 1.17.0. #9531

arjunkchr · 2021-07-22T07:53:38Z

Recently, we upgraded telegraf version from 1.17.0 and 1.19.1 and increased the buffer limit for telegraf.

Telegraf has wavefront output plugin configured.

On v1.17.0, we do not see "Metric buffer overflow" and the metrics are reported to wavefront correctly.

On v1.19.1, the metrics is not showing up in Wavefront correctly and we continue to see "Metric buffer overflow" being reported by telegraf.

srebhan · 2021-07-23T08:50:00Z

Can you please post the exact error message on v1.19.1. Furthermore, does your setup work with v1.18.3?

arjunkchr · 2021-07-23T09:28:28Z

We get messages like these "Metric buffer overflow; 2584 metrics have been dropped".

We tried increasing the buffer limit to 20000 and even that didn't help. The batch size is currently set to 1000 and the interval is set to 10 seconds.

Do you recommend upgrading to v1.18.3 to see if that helps? Haven't tried with v1.18.3 yet.

srebhan · 2021-07-23T10:21:43Z

Those things usually only happen if an output cannot flush the metrics to the sink. Do you see any output errors in the logs? What outputs did you configure?

Let's first see if we can track down the problem with v1.19 and if nothing helps we try v1.18 as a last resort. ;-)

whyjp · 2021-08-09T04:45:46Z

I also seen overflow dropped metrics logs.

only 1 that server isn't flush buffers

2021-08-06T10:49:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 503.3862ms
2021-08-06T10:49:00+02:00 D! [outputs.influxdb] Buffer fullness: 3836 / 10000 metrics
2021-08-06T10:49:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 492.1919ms
2021-08-06T10:49:01+02:00 D! [outputs.influxdb] Buffer fullness: 2836 / 10000 metrics
2021-08-06T10:50:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 470.3776ms
2021-08-06T10:50:00+02:00 D! [outputs.influxdb] Buffer fullness: 6680 / 10000 metrics
2021-08-06T10:50:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 454.9001ms
2021-08-06T10:50:00+02:00 D! [outputs.influxdb] Buffer fullness: 5680 / 10000 metrics
2021-08-06T10:51:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 766.4008ms
2021-08-06T10:51:00+02:00 D! [outputs.influxdb] Buffer fullness: 9522 / 10000 metrics
2021-08-06T10:51:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 501.5634ms
2021-08-06T10:51:01+02:00 D! [outputs.influxdb] Buffer fullness: 8522 / 10000 metrics
2021-08-06T10:52:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 522.7729ms
2021-08-06T10:52:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics
2021-08-06T10:52:00+02:00 W! [outputs.influxdb] Metric buffer overflow; 2363 metrics have been dropped
2021-08-06T10:52:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 752.8037ms
2021-08-06T10:52:01+02:00 D! [outputs.influxdb] Buffer fullness: 9000 / 10000 metrics
2021-08-06T10:53:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 515.9004ms
2021-08-06T10:53:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics
2021-08-06T10:53:00+02:00 W! [outputs.influxdb] Metric buffer overflow; 2841 metrics have been dropped
2021-08-06T10:53:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 738.9494ms
2021-08-06T10:53:01+02:00 D! [outputs.influxdb] Buffer fullness: 9000 / 10000 metrics
2021-08-06T10:54:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 530.0422ms
2021-08-06T10:54:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics
2021-08-06T10:54:00+02:00 W! [outputs.influxdb] Metric buffer overflow; 2840 metrics have been dropped
2021-08-06T10:54:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 774.6095ms
2021-08-06T10:54:01+02:00 D! [outputs.influxdb] Buffer fullness: 9000 / 10000 metrics
2021-08-06T10:55:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 503.1104ms
2021-08-06T10:55:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics
2021-08-06T10:55:00+02:00 W! [outputs.influxdb] Metric buffer overflow; 2841 metrics have been dropped
2021-08-06T10:55:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 727.7792ms
2021-08-06T10:55:01+02:00 D! [outputs.influxdb] Buffer fullness: 9000 / 10000 metrics
2021-08-06T10:56:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 549.9128ms
2021-08-06T10:56:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics
2021-08-06T10:56:00+02:00 W! [outputs.influxdb] Metric buffer overflow; 2841 metrics have been dropped
2021-08-06T10:56:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 740.3805ms
2021-08-06T10:56:01+02:00 D! [outputs.influxdb] Buffer fullness: 9000 / 10000 metrics
2021-08-06T10:57:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 511.0906ms
2021-08-06T10:57:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics

buffer is not flush and overflow after full buffers

dindurthy · 2021-08-17T21:57:57Z

We are seeing the same issue on 1.19.2. We have had some success by setting the flush_interval less than the interval, i.e.

interval = "10s"
flush_interval = "5s"

There used to be a warning about that, but seems to be obsolete

Based on @whyjp's debug logs, I'm guessing we have the same issue and reducing the flush interval is working because when we had a flush_interval at 10s, our 10000 metric_batch_size is governing the writes and can't keep up with the input. Reducing the flush interval means that we are probably now flushing all metrics every 5s instead of flushing the metric_batch_size as fast as we can. So, upping the metric_batch_size might be another way to go.

srebhan · 2021-09-06T12:12:23Z

@arjunkchr, @whyjp, @dindurthy :
Can anyone please try to find out the latest working version (and the first erroneous version), so we can try to find out which change cases the issue?

I currently cannot reproduce your setup without further information...

whyjp · 2021-09-06T12:16:45Z

@arjunkchr, @whyjp, @dindurthy :
Can anyone please try to find out the latest working version (and the first erroneous version), so we can try to find out which change cases the issue?

I currently cannot reproduce your setup without further information...

Now some other things have changed on that server (not related to telegraf)
Since you can't see the log after that event occurred
No more tracking
Unfortunately, I want to know the cause and fix it, but I can't.

gnjack · 2021-09-06T15:23:11Z

We are seeing a similar issue where the buffer isn't being flushed properly after upgrading to telegraf 1.19.1 when there is a high metric throughput relative to metric_batch_size / flush_interval, might be the same cause.
#9726

srebhan · 2021-09-09T12:24:09Z

@gnjack ok we now have

last known good 1.17.0
first known bad 1.19.1

However, this is a huuuuge chunk of code, so it would be interesting of last 1.18 works for you...

gnjack · 2021-09-09T13:06:11Z

We've been trying out 1.18.3 for the past day and it's looking good so far - no buffers staying full:

I'll be rolling 1.18.3 out to some more environments today, I'll update if we see any issues.

gnjack · 2021-09-09T13:13:31Z

I think the buffer flush behaviour under heavy load (full batches always available before the flush interval passes) would be fixed by changing this line from output.WriteBatch to output.Write.

telegraf/agent/agent.go

Line 811 in 95ef674

logError(a.flushOnce(output, ticker, output.WriteBatch))

This would mean an early flush would flush all available batches, instead of just a single batch, just like the other flush triggers (interval / shutdown).

Hipska · 2021-09-23T07:24:36Z

@gnjack that's correct and has also been confirmed by @powersj he will provide a patch soon..

powersj · 2021-09-23T12:57:58Z

@gnjack if possible, could you attempt to use the Telegraf artifacts in my PR?

Hipska · 2021-10-01T13:19:00Z

Solved by #9800

telegraf-tiger bot added the area/wavefront label Jul 22, 2021

srebhan self-assigned this Jul 23, 2021

Hipska assigned powersj Sep 23, 2021

Hipska added area/agent bug unexpected problem or unintended behavior and removed area/wavefront labels Sep 23, 2021

Hipska mentioned this issue Sep 23, 2021

fix: revert reset of ticker #9800

Merged

3 tasks

sjwang90 mentioned this issue Sep 29, 2021

regression in 1.20.0 influxdb output #9802

Closed

Hipska linked a pull request Oct 1, 2021 that will close this issue

fix: revert reset of ticker #9800

Merged

3 tasks

Hipska closed this as completed Oct 1, 2021

powersj mentioned this issue Aug 4, 2022

fix(agent): add flushBatch method #11615

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Metric buffer overflow" with telegraf 1.19.1, not on version 1.17.0. #9531

"Metric buffer overflow" with telegraf 1.19.1, not on version 1.17.0. #9531

arjunkchr commented Jul 22, 2021

srebhan commented Jul 23, 2021

arjunkchr commented Jul 23, 2021

srebhan commented Jul 23, 2021

whyjp commented Aug 9, 2021

dindurthy commented Aug 17, 2021

srebhan commented Sep 6, 2021

whyjp commented Sep 6, 2021

gnjack commented Sep 6, 2021

srebhan commented Sep 9, 2021 •

edited

Loading

gnjack commented Sep 9, 2021

gnjack commented Sep 9, 2021

Hipska commented Sep 23, 2021

powersj commented Sep 23, 2021

Hipska commented Oct 1, 2021

"Metric buffer overflow" with telegraf 1.19.1, not on version 1.17.0. #9531

"Metric buffer overflow" with telegraf 1.19.1, not on version 1.17.0. #9531

Comments

arjunkchr commented Jul 22, 2021

srebhan commented Jul 23, 2021

arjunkchr commented Jul 23, 2021

srebhan commented Jul 23, 2021

whyjp commented Aug 9, 2021

dindurthy commented Aug 17, 2021

srebhan commented Sep 6, 2021

whyjp commented Sep 6, 2021

gnjack commented Sep 6, 2021

srebhan commented Sep 9, 2021 • edited Loading

gnjack commented Sep 9, 2021

gnjack commented Sep 9, 2021

Hipska commented Sep 23, 2021

powersj commented Sep 23, 2021

Hipska commented Oct 1, 2021

srebhan commented Sep 9, 2021 •

edited

Loading