Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(agent): add flushBatch method #11615

Merged
merged 3 commits into from
Aug 17, 2022
Merged

fix(agent): add flushBatch method #11615

merged 3 commits into from
Aug 17, 2022

Conversation

powersj
Copy link
Contributor

@powersj powersj commented Aug 3, 2022

When Telegraf processes a very large amount of data, such that the
batch size is constantly hit, then Telegraf may be sending batches
constantly. At the same time, the flush interval could be hit. During
the sending of a batch, the ticker for the flush interval can go to
zero, and as a result, the batch will produce a warning about not
sending within the interval time.

Sending batches should not send messages to the user that the interval
expired as they are separate paths.

Previously, #8953 attempted to reset the ticker while sending a
batch. This introduced an issue where the buffer was never fully
flushed.

fixes: #8941

When Telegraf processes a very large amount of data, such that the
batch size is constantly hit, then Telegraf may be sending batches
constantly. At the same time, the flush interval could be hit. During
the sending of a batch, the ticker for the flush interval can go to
zero, and as a result, the batch will produce a warning about not
sending within the interval time.

Sending batches should not send messages to the user that the interval
expired as they are sepearte paths.

Previously, influxdata#8953 attempted to reset the ticker during the sending of a
batch. This introduced an issue where the buffer was never fully
flushed.

fixes: influxdata#8941
@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Aug 3, 2022
@powersj
Copy link
Contributor Author

powersj commented Aug 4, 2022

First, to ensure we do not re-introduce the previous buffer never flushing issue (i.e. #9799, #9726, #9531)

Use the following script to generate consistent metrics and telegraf configuration:

#!/bin/bash
# Echo out 1000 line format metrics with nano-timestamp precision

for (( i=0; i<1000; ++i)); do
    echo "metrics data=$i $(date +%s%N)"
done
[agent]
  precision = "1ns"
  interval = "1s"
  flush_interval = "10s"
  metric_batch_size = 1500
  metric_buffer_limit = 50000
  round_interval = true
  debug = true
  quiet = false
  omit_hostname = true

[[inputs.exec]]
  commands = [
    "bash /home/$USER/echo_metrics.sh",
  ]
  data_format = "influx"

[[outputs.influxdb_v2]]
  urls = ["http://localhost:8086"]
  token = "my-token"
  organization = "my-org"
  bucket = "my-bucket"
  1. After letting Telegraf run for 5-10 seconds, kill the connection to InfluxDB (e.g. stopping/starting the database (e.g. sudo systemctl stop influxdb; sleep 1; sudo systemctl start influxdb)
  2. Observe the buffer fullness increase to ~10,000-15,000
  3. Once InfluxDB is back up, observe if the buffer value ever decreases
2022-08-04T16:23:02Z I! Starting Telegraf 1.24.0-6cc8c5d5
2022-08-04T16:23:02Z I! Loaded inputs: exec
2022-08-04T16:23:02Z I! Loaded aggregators: 
2022-08-04T16:23:02Z I! Loaded processors: 
2022-08-04T16:23:02Z I! Loaded outputs: influxdb_v2
2022-08-04T16:23:02Z I! Tags enabled: 
2022-08-04T16:23:02Z I! [agent] Config: Interval:1s, Quiet:false, Hostname:"", Flush Interval:10s
2022-08-04T16:23:02Z D! [agent] Initializing plugins
2022-08-04T16:23:02Z D! [agent] Connecting outputs
2022-08-04T16:23:02Z D! [agent] Attempting connection to [outputs.influxdb_v2]
2022-08-04T16:23:02Z D! [agent] Successfully connected to outputs.influxdb_v2
2022-08-04T16:23:02Z D! [agent] Starting service inputs
2022-08-04T16:23:04Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 33.64747ms
2022-08-04T16:23:04Z D! [outputs.influxdb_v2] Buffer fullness: 500 / 50000 metrics
2022-08-04T16:23:05Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.057059ms
2022-08-04T16:23:05Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-08-04T16:23:07Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 10.066429ms
2022-08-04T16:23:07Z D! [outputs.influxdb_v2] Buffer fullness: 500 / 50000 metrics
2022-08-04T16:23:08Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 10.174138ms
2022-08-04T16:23:08Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-08-04T16:23:10Z E! [outputs.influxdb_v2] When writing to [http://localhost:8086]: Post "http://localhost:8086/api/v2/write?bucket=my-bucket&org=my-org": dial tcp [::1]:8086: connect: connection refused
2022-08-04T16:23:10Z D! [outputs.influxdb_v2] Buffer fullness: 2000 / 50000 metrics
2022-08-04T16:23:10Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
2022-08-04T16:23:11Z E! [outputs.influxdb_v2] When writing to [http://localhost:8086]: Post "http://localhost:8086/api/v2/write?bucket=my-bucket&org=my-org": dial tcp [::1]:8086: connect: connection refused
2022-08-04T16:23:11Z D! [outputs.influxdb_v2] Buffer fullness: 3000 / 50000 metrics
2022-08-04T16:23:11Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
2022-08-04T16:23:12Z E! [outputs.influxdb_v2] When writing to [http://localhost:8086]: Post "http://localhost:8086/api/v2/write?bucket=my-bucket&org=my-org": dial tcp [::1]:8086: connect: connection refused
2022-08-04T16:23:12Z D! [outputs.influxdb_v2] Buffer fullness: 3000 / 50000 metrics
2022-08-04T16:23:12Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
2022-08-04T16:23:13Z E! [outputs.influxdb_v2] When writing to [http://localhost:8086]: Post "http://localhost:8086/api/v2/write?bucket=my-bucket&org=my-org": dial tcp [::1]:8086: connect: connection refused
2022-08-04T16:23:13Z D! [outputs.influxdb_v2] Buffer fullness: 5000 / 50000 metrics
2022-08-04T16:23:13Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
2022-08-04T16:23:14Z E! [outputs.influxdb_v2] When writing to [http://localhost:8086]: Post "http://localhost:8086/api/v2/write?bucket=my-bucket&org=my-org": dial tcp [::1]:8086: connect: connection refused
2022-08-04T16:23:14Z D! [outputs.influxdb_v2] Buffer fullness: 6000 / 50000 metrics
2022-08-04T16:23:14Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
2022-08-04T16:23:16Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 32.681509ms
2022-08-04T16:23:16Z D! [outputs.influxdb_v2] Buffer fullness: 6500 / 50000 metrics
2022-08-04T16:23:17Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.216695ms
2022-08-04T16:23:17Z D! [outputs.influxdb_v2] Buffer fullness: 6000 / 50000 metrics
2022-08-04T16:23:19Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.275306ms
2022-08-04T16:23:19Z D! [outputs.influxdb_v2] Buffer fullness: 6500 / 50000 metrics
2022-08-04T16:23:20Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.042065ms
2022-08-04T16:23:20Z D! [outputs.influxdb_v2] Buffer fullness: 6000 / 50000 metrics
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.203356ms
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 4.973343ms
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 5.513991ms
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 5.067002ms
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Wrote batch of 1000 metrics in 4.012874ms
2022-08-04T16:23:22Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-08-04T16:23:23Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 9.249476ms
2022-08-04T16:23:23Z D! [outputs.influxdb_v2] Buffer fullness: 500 / 50000 metrics
2022-08-04T16:23:24Z D! [outputs.influxdb_v2] Wrote batch of 1500 metrics in 10.042724ms
2022-08-04T16:23:24Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics

@powersj
Copy link
Contributor Author

powersj commented Aug 4, 2022

A second test case is to verify that batches are not printing out any messages about the ticker expiring during a flush.

Using the following Telegraf config, which without this fix will produce the message every couple flush intervals or more:

[agent]
  metric_batch_size = 120000
  metric_buffer_limit = 1200000
  flush_interval = "3s"
  quiet = false
  debug = false

[[outputs.file]]
  files = ["/tmp/metrics.out"]
  data_format = "influx"
  rotation_max_size = "1GB"

[[inputs.execd]]
  command = ["/bin/bash","-c","while true ; do echo 'test,bu=metrics value=1' ; done"]
  data_format = "influx"

Without this PR:

2022-08-04T16:13:31Z I! Starting Telegraf 1.24.0-c55ab71a
2022-08-04T16:13:31Z I! Loaded inputs: execd
2022-08-04T16:13:31Z I! Loaded aggregators: 
2022-08-04T16:13:31Z I! Loaded processors: 
2022-08-04T16:13:31Z I! Loaded outputs: file
2022-08-04T16:13:31Z I! Tags enabled: host=ryzen
2022-08-04T16:13:31Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"ryzen", Flush Interval:3s
2022-08-04T16:13:31Z I! [inputs.execd] Starting process: /bin/bash [-c while true ; do echo 'test,bu=metrics value=1' ; done]
2022-08-04T16:13:34Z W! [agent] ["outputs.file"] did not complete within its flush interval
2022-08-04T16:13:37Z W! [agent] ["outputs.file"] did not complete within its flush interval
2022-08-04T16:13:40Z W! [agent] ["outputs.file"] did not complete within its flush interval
2022-08-04T16:13:43Z W! [agent] ["outputs.file"] did not complete within its flush interval

With this PR:

2022-08-04T16:14:16Z I! Starting Telegraf 1.24.0-6cc8c5d5
2022-08-04T16:14:16Z I! Loaded inputs: execd
2022-08-04T16:14:16Z I! Loaded aggregators: 
2022-08-04T16:14:16Z I! Loaded processors: 
2022-08-04T16:14:16Z I! Loaded outputs: file
2022-08-04T16:14:16Z I! Tags enabled: host=ryzen
2022-08-04T16:14:16Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"ryzen", Flush Interval:3s
2022-08-04T16:14:16Z I! [inputs.execd] Starting process: /bin/bash [-c while true ; do echo 'test,bu=metrics value=1' ; done]
<no further output for 40+ mins>

@powersj powersj marked this pull request as ready for review August 9, 2022 15:06
@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Aug 11, 2022
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@telegraf-tiger
Copy link
Contributor

Download PR build artifacts for linux_amd64.tar.gz, darwin_amd64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

🥳 This pull request decreases the Telegraf binary size by -1.17 % for linux amd64 (new size: 148.4 MB, nightly size 150.2 MB)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB RPM TAR GZ ZIP
amd64.deb aarch64.rpm darwin_amd64.tar.gz windows_amd64.zip
arm64.deb armel.rpm darwin_arm64.tar.gz windows_i386.zip
armel.deb armv6hl.rpm freebsd_amd64.tar.gz
armhf.deb i386.rpm freebsd_armv7.tar.gz
i386.deb ppc64le.rpm freebsd_i386.tar.gz
mips.deb riscv64.rpm linux_amd64.tar.gz
mipsel.deb s390x.rpm linux_arm64.tar.gz
ppc64el.deb x86_64.rpm linux_armel.tar.gz
riscv64.deb linux_armhf.tar.gz
s390x.deb linux_i386.tar.gz
linux_mips.tar.gz
linux_mipsel.tar.gz
linux_ppc64le.tar.gz
linux_riscv64.tar.gz
linux_s390x.tar.gz
static_linux_amd64.tar.gz

@powersj powersj merged commit 0481a78 into influxdata:master Aug 17, 2022
@powersj powersj deleted the fix/8941 branch August 17, 2022 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix pr to fix corresponding bug ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

did not complete within its flush interval
3 participants