Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix metrics reported as written but not actually written #9526

Merged
merged 2 commits into from
Jul 28, 2021

Conversation

MyaLongmire
Copy link
Contributor

Required for all PRs:

  • Updated associated README.md.
  • Wrote appropriate unit tests.

resolves #9514

Reworked code to handle more than database not found errors so it will not say metrics are getting pushed when the database goes down. Also updated test

@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Jul 21, 2021
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @MyaLongmire, thanks for the PR! There are some debug messages left in the code and I'd like to ask you to add a comment on the "strange" error handling. I guess it has to be like this for telegraf to trigger a retry, however it might be nicer to retry directly after successfully creating the database.
Assume your metric buffer is full and that's why a flush/write was triggered in the first place. In this first try you create the database and return an error (as you currently do). Now your metric buffer is still full. If new metrics arrive between this and the next retry you will loose metrics due to buffer fullness...

plugins/outputs/influxdb/http_test.go Outdated Show resolved Hide resolved
plugins/outputs/influxdb/influxdb.go Show resolved Hide resolved
plugins/outputs/influxdb/influxdb.go Outdated Show resolved Hide resolved
plugins/outputs/influxdb/influxdb.go Show resolved Hide resolved
@srebhan srebhan self-assigned this Jul 22, 2021
@srebhan srebhan added area/influxdb plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins fix pr to fix corresponding bug and removed feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin labels Jul 22, 2021
@srebhan
Copy link
Member

srebhan commented Jul 23, 2021

@MyaLongmire can you comment on the potential buffer fullness issue I mentioned. It's ok for me if we do not tackle it in this PR, but if this is not just hypothetical, we need to take care of it.

@MyaLongmire
Copy link
Contributor Author

@MyaLongmire can you comment on the potential buffer fullness issue I mentioned. It's ok for me if we do not tackle it in this PR, but if this is not just hypothetical, we need to take care of it.

@srebhan in my understanding if the buffer is full when new metrics come in the oldest are dropped, there nothing much to be done about that.

@srebhan
Copy link
Member

srebhan commented Jul 27, 2021

@MyaLongmire imagine the case where the database does not yet exist on the InfluxDB side. Further assume the buffer runs full (e.g. due to a delayed start of the database service or whatever). In the ideal case, the output plugin would create the database and flush the metric to this newly created database. That's what I would expect as user.
However, from my understanding what the plugin does is, it creates the database in a first flush cycle without flushing the data to the newly created database and waits for a second flush cycle to write the data, The "data" in the second cycle comprises the "old" one from the first cycle and the new data arrived in the meantime. If now the buffer is already full (or almost full) in the first cycle, the data arriving between the first and second cycle might lead to data-dropping.

My suggestion is to change the current state

plugin.Write() // --> will create the database and return error without flushing the buffer
... gathering of data...
plugin.Write() // --> will flush the buffer containing data of the first "Write()" call and new data

to

plugin.Write() // --> will create the database and try flushing the buffer
... gathering of data...
plugin.Write() // --> will flush the buffer containing ONLY new data

@MyaLongmire MyaLongmire changed the title Bug Fix: influxdb_listener - fix metrics saying they got written but did not Bug Fix: influxdb output - fix metrics saying they got written but did not Jul 27, 2021
@ssoroka
Copy link
Contributor

ssoroka commented Jul 27, 2021

I get what you're saying @srebhan , but I don't think it's very likely that the buffer is full and the database has not yet been created but will be created successfully on this call. If that is the case, it's likely the buffer was already full and dropping messages.
I'm a bit hesitant to repeat the write immediately as the code is not very straight-forward. the choice to return an error to trigger retry is one chosen to keep the code simple in an area we've already had a lot of issues with code complexity.

@srebhan
Copy link
Member

srebhan commented Jul 28, 2021

@ssoroka understood. It's fine with me.

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Jul 28, 2021
@reimda reimda changed the title Bug Fix: influxdb output - fix metrics saying they got written but did not Fix metrics reported as written but not actually written Jul 28, 2021
@reimda reimda merged commit 8d2b1e8 into master Jul 28, 2021
@reimda reimda deleted the bugfix-influxdb-listener branch July 28, 2021 20:55
reimda pushed a commit that referenced this pull request Jul 28, 2021
phemmer added a commit to phemmer/telegraf that referenced this pull request Aug 13, 2021
* origin/master: (183 commits)
  fix: CrateDB replace dots in tag keys with underscores (influxdata#9566)
  feat: Pull metrics from multiple AWS CloudWatch namespaces (influxdata#9386)
  fix: improve Clickhouse corner cases for empty recordset in aggregation queries, fix dictionaries behavior (influxdata#9401)
  fix(opcua): clean client on disconnect so that connect works cleanly (influxdata#9583)
  fix: Refactor ec2 init for config-api (influxdata#9576)
  fix: sort logs by timestamp before writing to Loki (influxdata#9571)
  fix: muting tests for udp_listener (influxdata#9578)
  fix: Do not return on disconnect to avoid breaking reconnect (influxdata#9524)
  fix: Fixing k8s nodes and pods parsing error (influxdata#9581)
  feat: OpenTelemetry output plugin (influxdata#9228)
  feat: Support AWS Web Identity Provider (influxdata#9411)
  fix: upgraded sensu/go to v2.9.0 (influxdata#9577)
  fix: Normalize unix socket path (influxdata#9554)
  docs: fix aws ec2 readme inconsistency (influxdata#9567)
  feat: Modbus Rtu over tcp enhancement (influxdata#9570)
  docs: information on new conventional commit format (influxdata#9573)
  docs: Add logo (influxdata#9574)
  docs: Adding links to net_irtt and dht_sensor external plugins (influxdata#9569)
  Upgrade hashicorp/consul/api to 1.9.1 (influxdata#9565)
  Update vmware/govmomi to v0.26.0 (influxdata#9552)
  Do not skip good quality nodes after a bad quality node is encountered (influxdata#9550)
  fix test so it hits a fake service (influxdata#9564)
  Update changelog
  Fix procstat plugin README to match sample config (influxdata#9553)
  Fix metrics reported as written but not actually written  (influxdata#9526)
  Prevent segfault in persistent volume claims (influxdata#9549)
  Update procstat to support cgroup globs & include systemd unit children (Copy of influxdata#7890) (influxdata#9488)
  Fix attempt to connect to an empty list of servers. (influxdata#9503)
  Fix handling bool in sql input plugin (influxdata#9540)
  Suricata alerts (influxdata#9322)
  Linter fixes for plugins/inputs/[fg]* (influxdata#9387)
  For Prometheus Input add ability to query Consul Service catalog (influxdata#5464)
  Support Landing page on Prometheus landing page (influxdata#8641)
  [Docs] Clarify tagging behavior (influxdata#9461)
  Change the timeout from all queries to per query (influxdata#9471)
  Attach the pod labels to the `kubernetes_pod_volume` & `kubernetes_pod_network` metrics. (influxdata#9438)
  feat(http_listener_v2): allows multiple paths and add path_tag (influxdata#9529)
  Bug Fix Snmp empty metric name (influxdata#9519)
  Worktable workfile stats (influxdata#8587)
  Update Go to v1.16.6 (influxdata#9542)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxdb fix pr to fix corresponding bug plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

outputs.influxdb not buffering points on telegraf 1.19.1
4 participants