Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Metric buffer overflow" with telegraf 1.19.1, not on version 1.17.0. #9531

Closed
arjunkchr opened this issue Jul 22, 2021 · 14 comments · Fixed by #9800
Closed

"Metric buffer overflow" with telegraf 1.19.1, not on version 1.17.0. #9531

arjunkchr opened this issue Jul 22, 2021 · 14 comments · Fixed by #9800
Assignees
Labels
area/agent bug unexpected problem or unintended behavior

Comments

@arjunkchr
Copy link

Recently, we upgraded telegraf version from 1.17.0 and 1.19.1 and increased the buffer limit for telegraf.

Telegraf has wavefront output plugin configured.

On v1.17.0, we do not see "Metric buffer overflow" and the metrics are reported to wavefront correctly.

On v1.19.1, the metrics is not showing up in Wavefront correctly and we continue to see "Metric buffer overflow" being reported by telegraf.

@srebhan
Copy link
Member

srebhan commented Jul 23, 2021

Can you please post the exact error message on v1.19.1. Furthermore, does your setup work with v1.18.3?

@srebhan srebhan self-assigned this Jul 23, 2021
@arjunkchr
Copy link
Author

We get messages like these "Metric buffer overflow; 2584 metrics have been dropped".

We tried increasing the buffer limit to 20000 and even that didn't help. The batch size is currently set to 1000 and the interval is set to 10 seconds.

Do you recommend upgrading to v1.18.3 to see if that helps? Haven't tried with v1.18.3 yet.

@srebhan
Copy link
Member

srebhan commented Jul 23, 2021

Those things usually only happen if an output cannot flush the metrics to the sink. Do you see any output errors in the logs? What outputs did you configure?

Let's first see if we can track down the problem with v1.19 and if nothing helps we try v1.18 as a last resort. ;-)

@whyjp
Copy link

whyjp commented Aug 9, 2021

I also seen overflow dropped metrics logs.

only 1 that server isn't flush buffers

2021-08-06T10:49:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 503.3862ms
2021-08-06T10:49:00+02:00 D! [outputs.influxdb] Buffer fullness: 3836 / 10000 metrics
2021-08-06T10:49:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 492.1919ms
2021-08-06T10:49:01+02:00 D! [outputs.influxdb] Buffer fullness: 2836 / 10000 metrics
2021-08-06T10:50:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 470.3776ms
2021-08-06T10:50:00+02:00 D! [outputs.influxdb] Buffer fullness: 6680 / 10000 metrics
2021-08-06T10:50:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 454.9001ms
2021-08-06T10:50:00+02:00 D! [outputs.influxdb] Buffer fullness: 5680 / 10000 metrics
2021-08-06T10:51:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 766.4008ms
2021-08-06T10:51:00+02:00 D! [outputs.influxdb] Buffer fullness: 9522 / 10000 metrics
2021-08-06T10:51:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 501.5634ms
2021-08-06T10:51:01+02:00 D! [outputs.influxdb] Buffer fullness: 8522 / 10000 metrics
2021-08-06T10:52:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 522.7729ms
2021-08-06T10:52:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics
2021-08-06T10:52:00+02:00 W! [outputs.influxdb] Metric buffer overflow; 2363 metrics have been dropped
2021-08-06T10:52:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 752.8037ms
2021-08-06T10:52:01+02:00 D! [outputs.influxdb] Buffer fullness: 9000 / 10000 metrics
2021-08-06T10:53:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 515.9004ms
2021-08-06T10:53:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics
2021-08-06T10:53:00+02:00 W! [outputs.influxdb] Metric buffer overflow; 2841 metrics have been dropped
2021-08-06T10:53:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 738.9494ms
2021-08-06T10:53:01+02:00 D! [outputs.influxdb] Buffer fullness: 9000 / 10000 metrics
2021-08-06T10:54:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 530.0422ms
2021-08-06T10:54:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics
2021-08-06T10:54:00+02:00 W! [outputs.influxdb] Metric buffer overflow; 2840 metrics have been dropped
2021-08-06T10:54:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 774.6095ms
2021-08-06T10:54:01+02:00 D! [outputs.influxdb] Buffer fullness: 9000 / 10000 metrics
2021-08-06T10:55:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 503.1104ms
2021-08-06T10:55:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics
2021-08-06T10:55:00+02:00 W! [outputs.influxdb] Metric buffer overflow; 2841 metrics have been dropped
2021-08-06T10:55:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 727.7792ms
2021-08-06T10:55:01+02:00 D! [outputs.influxdb] Buffer fullness: 9000 / 10000 metrics
2021-08-06T10:56:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 549.9128ms
2021-08-06T10:56:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics
2021-08-06T10:56:00+02:00 W! [outputs.influxdb] Metric buffer overflow; 2841 metrics have been dropped
2021-08-06T10:56:01+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 740.3805ms
2021-08-06T10:56:01+02:00 D! [outputs.influxdb] Buffer fullness: 9000 / 10000 metrics
2021-08-06T10:57:00+02:00 D! [outputs.influxdb] Wrote batch of 1000 metrics in 511.0906ms
2021-08-06T10:57:00+02:00 D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics

buffer is not flush and overflow after full buffers


@dindurthy
Copy link

We are seeing the same issue on 1.19.2. We have had some success by setting the flush_interval less than the interval, i.e.

interval = "10s"
flush_interval = "5s"

There used to be a warning about that, but seems to be obsolete

Based on @whyjp's debug logs, I'm guessing we have the same issue and reducing the flush interval is working because when we had a flush_interval at 10s, our 10000 metric_batch_size is governing the writes and can't keep up with the input. Reducing the flush interval means that we are probably now flushing all metrics every 5s instead of flushing the metric_batch_size as fast as we can. So, upping the metric_batch_size might be another way to go.

@srebhan
Copy link
Member

srebhan commented Sep 6, 2021

@arjunkchr, @whyjp, @dindurthy :
Can anyone please try to find out the latest working version (and the first erroneous version), so we can try to find out which change cases the issue?

I currently cannot reproduce your setup without further information...

@whyjp
Copy link

whyjp commented Sep 6, 2021

@arjunkchr, @whyjp, @dindurthy :
Can anyone please try to find out the latest working version (and the first erroneous version), so we can try to find out which change cases the issue?

I currently cannot reproduce your setup without further information...

Now some other things have changed on that server (not related to telegraf)
Since you can't see the log after that event occurred
No more tracking
Unfortunately, I want to know the cause and fix it, but I can't.

@gnjack
Copy link
Contributor

gnjack commented Sep 6, 2021

We are seeing a similar issue where the buffer isn't being flushed properly after upgrading to telegraf 1.19.1 when there is a high metric throughput relative to metric_batch_size / flush_interval, might be the same cause.
#9726

@srebhan
Copy link
Member

srebhan commented Sep 9, 2021

@gnjack ok we now have

  • last known good 1.17.0
  • first known bad 1.19.1

However, this is a huuuuge chunk of code, so it would be interesting of last 1.18 works for you...

@gnjack
Copy link
Contributor

gnjack commented Sep 9, 2021

We've been trying out 1.18.3 for the past day and it's looking good so far - no buffers staying full:
image

I'll be rolling 1.18.3 out to some more environments today, I'll update if we see any issues.

@gnjack
Copy link
Contributor

gnjack commented Sep 9, 2021

I think the buffer flush behaviour under heavy load (full batches always available before the flush interval passes) would be fixed by changing this line from output.WriteBatch to output.Write.

logError(a.flushOnce(output, ticker, output.WriteBatch))

This would mean an early flush would flush all available batches, instead of just a single batch, just like the other flush triggers (interval / shutdown).

@Hipska
Copy link
Contributor

Hipska commented Sep 23, 2021

@gnjack that's correct and has also been confirmed by @powersj he will provide a patch soon..

@Hipska Hipska added area/agent bug unexpected problem or unintended behavior and removed area/wavefront labels Sep 23, 2021
@powersj
Copy link
Contributor

powersj commented Sep 23, 2021

@gnjack if possible, could you attempt to use the Telegraf artifacts in my PR?

@Hipska Hipska linked a pull request Oct 1, 2021 that will close this issue
3 tasks
@Hipska
Copy link
Contributor

Hipska commented Oct 1, 2021

Solved by #9800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants