-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix metrics reported as written but not actually written #9526
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @MyaLongmire, thanks for the PR! There are some debug messages left in the code and I'd like to ask you to add a comment on the "strange" error handling. I guess it has to be like this for telegraf to trigger a retry, however it might be nicer to retry directly after successfully creating the database.
Assume your metric buffer is full and that's why a flush/write was triggered in the first place. In this first try you create the database and return an error (as you currently do). Now your metric buffer is still full. If new metrics arrive between this and the next retry you will loose metrics due to buffer fullness...
Looks like new artifacts were built from this PR. Get them here!Artifact URLs |
@MyaLongmire can you comment on the potential buffer fullness issue I mentioned. It's ok for me if we do not tackle it in this PR, but if this is not just hypothetical, we need to take care of it. |
@srebhan in my understanding if the buffer is full when new metrics come in the oldest are dropped, there nothing much to be done about that. |
@MyaLongmire imagine the case where the database does not yet exist on the InfluxDB side. Further assume the buffer runs full (e.g. due to a delayed start of the database service or whatever). In the ideal case, the output plugin would create the database and flush the metric to this newly created database. That's what I would expect as user. My suggestion is to change the current state plugin.Write() // --> will create the database and return error without flushing the buffer
... gathering of data...
plugin.Write() // --> will flush the buffer containing data of the first "Write()" call and new data to plugin.Write() // --> will create the database and try flushing the buffer
... gathering of data...
plugin.Write() // --> will flush the buffer containing ONLY new data |
I get what you're saying @srebhan , but I don't think it's very likely that the buffer is full and the database has not yet been created but will be created successfully on this call. If that is the case, it's likely the buffer was already full and dropping messages. |
@ssoroka understood. It's fine with me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
(cherry picked from commit 8d2b1e8)
* origin/master: (183 commits) fix: CrateDB replace dots in tag keys with underscores (influxdata#9566) feat: Pull metrics from multiple AWS CloudWatch namespaces (influxdata#9386) fix: improve Clickhouse corner cases for empty recordset in aggregation queries, fix dictionaries behavior (influxdata#9401) fix(opcua): clean client on disconnect so that connect works cleanly (influxdata#9583) fix: Refactor ec2 init for config-api (influxdata#9576) fix: sort logs by timestamp before writing to Loki (influxdata#9571) fix: muting tests for udp_listener (influxdata#9578) fix: Do not return on disconnect to avoid breaking reconnect (influxdata#9524) fix: Fixing k8s nodes and pods parsing error (influxdata#9581) feat: OpenTelemetry output plugin (influxdata#9228) feat: Support AWS Web Identity Provider (influxdata#9411) fix: upgraded sensu/go to v2.9.0 (influxdata#9577) fix: Normalize unix socket path (influxdata#9554) docs: fix aws ec2 readme inconsistency (influxdata#9567) feat: Modbus Rtu over tcp enhancement (influxdata#9570) docs: information on new conventional commit format (influxdata#9573) docs: Add logo (influxdata#9574) docs: Adding links to net_irtt and dht_sensor external plugins (influxdata#9569) Upgrade hashicorp/consul/api to 1.9.1 (influxdata#9565) Update vmware/govmomi to v0.26.0 (influxdata#9552) Do not skip good quality nodes after a bad quality node is encountered (influxdata#9550) fix test so it hits a fake service (influxdata#9564) Update changelog Fix procstat plugin README to match sample config (influxdata#9553) Fix metrics reported as written but not actually written (influxdata#9526) Prevent segfault in persistent volume claims (influxdata#9549) Update procstat to support cgroup globs & include systemd unit children (Copy of influxdata#7890) (influxdata#9488) Fix attempt to connect to an empty list of servers. (influxdata#9503) Fix handling bool in sql input plugin (influxdata#9540) Suricata alerts (influxdata#9322) Linter fixes for plugins/inputs/[fg]* (influxdata#9387) For Prometheus Input add ability to query Consul Service catalog (influxdata#5464) Support Landing page on Prometheus landing page (influxdata#8641) [Docs] Clarify tagging behavior (influxdata#9461) Change the timeout from all queries to per query (influxdata#9471) Attach the pod labels to the `kubernetes_pod_volume` & `kubernetes_pod_network` metrics. (influxdata#9438) feat(http_listener_v2): allows multiple paths and add path_tag (influxdata#9529) Bug Fix Snmp empty metric name (influxdata#9519) Worktable workfile stats (influxdata#8587) Update Go to v1.16.6 (influxdata#9542) ...
Required for all PRs:
resolves #9514
Reworked code to handle more than database not found errors so it will not say metrics are getting pushed when the database goes down. Also updated test