-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(agent): add flushBatch method #11615
Conversation
When Telegraf processes a very large amount of data, such that the batch size is constantly hit, then Telegraf may be sending batches constantly. At the same time, the flush interval could be hit. During the sending of a batch, the ticker for the flush interval can go to zero, and as a result, the batch will produce a warning about not sending within the interval time. Sending batches should not send messages to the user that the interval expired as they are sepearte paths. Previously, influxdata#8953 attempted to reset the ticker during the sending of a batch. This introduced an issue where the buffer was never fully flushed. fixes: influxdata#8941
First, to ensure we do not re-introduce the previous buffer never flushing issue (i.e. #9799, #9726, #9531) Use the following script to generate consistent metrics and telegraf configuration: #!/bin/bash
# Echo out 1000 line format metrics with nano-timestamp precision
for (( i=0; i<1000; ++i)); do
echo "metrics data=$i $(date +%s%N)"
done [agent]
precision = "1ns"
interval = "1s"
flush_interval = "10s"
metric_batch_size = 1500
metric_buffer_limit = 50000
round_interval = true
debug = true
quiet = false
omit_hostname = true
[[inputs.exec]]
commands = [
"bash /home/$USER/echo_metrics.sh",
]
data_format = "influx"
[[outputs.influxdb_v2]]
urls = ["http://localhost:8086"]
token = "my-token"
organization = "my-org"
bucket = "my-bucket"
2022-08-04T16:23:02Z I! Starting Telegraf 1.24.0-6cc8c5d5
2022-08-04T16:23:02Z I! Loaded inputs: exec
2022-08-04T16:23:02Z I! Loaded aggregators:
2022-08-04T16:23:02Z I! Loaded processors:
2022-08-04T16:23:02Z I! Loaded outputs: influxdb_v2
2022-08-04T16:23:02Z I! Tags enabled:
2022-08-04T16:23:02Z I! [agent] Config: Interval:1s, Quiet:false, Hostname:"", Flush Interval:10s
2022-08-04T16:23:02Z D! [agent] Initializing plugins
2022-08-04T16:23:02Z D! [agent] Connecting outputs
2022-08-04T16:23:02Z D! [agent] Attempting connection to [outputs.influxdb_v2]
2022-08-04T16:23:02Z D! [agent] Successfully connected to outputs.influxdb_v2
2022-08-04T16:23:02Z D! [agent] Starting service inputs
2022-08-04T16:23:04Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 33.64747ms
2022-08-04T16:23:04Z D! [outputs.influxdb_v2] Buffer fullness: 500 / 50000 metrics
2022-08-04T16:23:05Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.057059ms
2022-08-04T16:23:05Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-08-04T16:23:07Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 10.066429ms
2022-08-04T16:23:07Z D! [outputs.influxdb_v2] Buffer fullness: 500 / 50000 metrics
2022-08-04T16:23:08Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 10.174138ms
2022-08-04T16:23:08Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-08-04T16:23:10Z E! [outputs.influxdb_v2] When writing to [http://localhost:8086]: Post "http://localhost:8086/api/v2/write?bucket=my-bucket&org=my-org": dial tcp [::1]:8086: connect: connection refused
2022-08-04T16:23:10Z D! [outputs.influxdb_v2] Buffer fullness: 2000 / 50000 metrics
2022-08-04T16:23:10Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
2022-08-04T16:23:11Z E! [outputs.influxdb_v2] When writing to [http://localhost:8086]: Post "http://localhost:8086/api/v2/write?bucket=my-bucket&org=my-org": dial tcp [::1]:8086: connect: connection refused
2022-08-04T16:23:11Z D! [outputs.influxdb_v2] Buffer fullness: 3000 / 50000 metrics
2022-08-04T16:23:11Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
2022-08-04T16:23:12Z E! [outputs.influxdb_v2] When writing to [http://localhost:8086]: Post "http://localhost:8086/api/v2/write?bucket=my-bucket&org=my-org": dial tcp [::1]:8086: connect: connection refused
2022-08-04T16:23:12Z D! [outputs.influxdb_v2] Buffer fullness: 3000 / 50000 metrics
2022-08-04T16:23:12Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
2022-08-04T16:23:13Z E! [outputs.influxdb_v2] When writing to [http://localhost:8086]: Post "http://localhost:8086/api/v2/write?bucket=my-bucket&org=my-org": dial tcp [::1]:8086: connect: connection refused
2022-08-04T16:23:13Z D! [outputs.influxdb_v2] Buffer fullness: 5000 / 50000 metrics
2022-08-04T16:23:13Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
2022-08-04T16:23:14Z E! [outputs.influxdb_v2] When writing to [http://localhost:8086]: Post "http://localhost:8086/api/v2/write?bucket=my-bucket&org=my-org": dial tcp [::1]:8086: connect: connection refused
2022-08-04T16:23:14Z D! [outputs.influxdb_v2] Buffer fullness: 6000 / 50000 metrics
2022-08-04T16:23:14Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
2022-08-04T16:23:16Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 32.681509ms
2022-08-04T16:23:16Z D! [outputs.influxdb_v2] Buffer fullness: 6500 / 50000 metrics
2022-08-04T16:23:17Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.216695ms
2022-08-04T16:23:17Z D! [outputs.influxdb_v2] Buffer fullness: 6000 / 50000 metrics
2022-08-04T16:23:19Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.275306ms
2022-08-04T16:23:19Z D! [outputs.influxdb_v2] Buffer fullness: 6500 / 50000 metrics
2022-08-04T16:23:20Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.042065ms
2022-08-04T16:23:20Z D! [outputs.influxdb_v2] Buffer fullness: 6000 / 50000 metrics
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.203356ms
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 4.973343ms
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 5.513991ms
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 5.067002ms
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Wrote batch of 1000 metrics in 4.012874ms
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-08-04T16:23:23Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.249476ms
2022-08-04T16:23:23Z D! [outputs.influxdb_v2] Buffer fullness: 500 / 50000 metrics
2022-08-04T16:23:24Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 10.042724ms
2022-08-04T16:23:24Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics |
A second test case is to verify that batches are not printing out any messages about the ticker expiring during a flush. Using the following Telegraf config, which without this fix will produce the message every couple flush intervals or more: [agent]
metric_batch_size = 120000
metric_buffer_limit = 1200000
flush_interval = "3s"
quiet = false
debug = false
[[outputs.file]]
files = ["/tmp/metrics.out"]
data_format = "influx"
rotation_max_size = "1GB"
[[inputs.execd]]
command = ["/bin/bash","-c","while true ; do echo 'test,bu=metrics value=1' ; done"]
data_format = "influx" Without this PR: 2022-08-04T16:13:31Z I! Starting Telegraf 1.24.0-c55ab71a
2022-08-04T16:13:31Z I! Loaded inputs: execd
2022-08-04T16:13:31Z I! Loaded aggregators:
2022-08-04T16:13:31Z I! Loaded processors:
2022-08-04T16:13:31Z I! Loaded outputs: file
2022-08-04T16:13:31Z I! Tags enabled: host=ryzen
2022-08-04T16:13:31Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"ryzen", Flush Interval:3s
2022-08-04T16:13:31Z I! [inputs.execd] Starting process: /bin/bash [-c while true ; do echo 'test,bu=metrics value=1' ; done]
2022-08-04T16:13:34Z W! [agent] ["outputs.file"] did not complete within its flush interval
2022-08-04T16:13:37Z W! [agent] ["outputs.file"] did not complete within its flush interval
2022-08-04T16:13:40Z W! [agent] ["outputs.file"] did not complete within its flush interval
2022-08-04T16:13:43Z W! [agent] ["outputs.file"] did not complete within its flush interval With this PR: 2022-08-04T16:14:16Z I! Starting Telegraf 1.24.0-6cc8c5d5
2022-08-04T16:14:16Z I! Loaded inputs: execd
2022-08-04T16:14:16Z I! Loaded aggregators:
2022-08-04T16:14:16Z I! Loaded processors:
2022-08-04T16:14:16Z I! Loaded outputs: file
2022-08-04T16:14:16Z I! Tags enabled: host=ryzen
2022-08-04T16:14:16Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"ryzen", Flush Interval:3s
2022-08-04T16:14:16Z I! [inputs.execd] Starting process: /bin/bash [-c while true ; do echo 'test,bu=metrics value=1' ; done]
<no further output for 40+ mins> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Download PR build artifacts for linux_amd64.tar.gz, darwin_amd64.tar.gz, and windows_amd64.zip. 🥳 This pull request decreases the Telegraf binary size by -1.17 % for linux amd64 (new size: 148.4 MB, nightly size 150.2 MB) 📦 Click here to get additional PR build artifactsArtifact URLs |
When Telegraf processes a very large amount of data, such that the
batch size is constantly hit, then Telegraf may be sending batches
constantly. At the same time, the flush interval could be hit. During
the sending of a batch, the ticker for the flush interval can go to
zero, and as a result, the batch will produce a warning about not
sending within the interval time.
Sending batches should not send messages to the user that the interval
expired as they are separate paths.
Previously, #8953 attempted to reset the ticker while sending a
batch. This introduced an issue where the buffer was never fully
flushed.
fixes: #8941