Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(outputs): Add framework to retry on startup errors #14884

Merged
merged 13 commits into from
Mar 26, 2024

Conversation

srebhan
Copy link
Member

@srebhan srebhan commented Feb 22, 2024

Summary

A frequently requested feature is to retry startup for plugins in case of error e.g. if the connection to a remote machine fails.

Instead of implementing the logic for retrying in each plugin, this PR provides a framework for output-plugins to use if they wish to support retries. As an example, the Kafka output plugin is converted to use the new framework

More details can be found in the TSD-006 specification.

Checklist

  • No AI generated code was used in this PR

Related issues

resolves #14803
resolves #9778

@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Feb 22, 2024
@srebhan srebhan added area/kafka area/agent plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins labels Feb 22, 2024
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this.

In general, all outputs are connecting to something or creating a server for something to connect to with the exception file? And I don't think we would want to change anything for file.

What do you think are the next steps?

models/running_output_test.go Outdated Show resolved Hide resolved
@telegraf-tiger
Copy link
Contributor

@srebhan srebhan marked this pull request as ready for review March 11, 2024 17:59
@srebhan srebhan assigned powersj and DStrand1 and unassigned srebhan Mar 11, 2024
@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Mar 11, 2024
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this! Did you want to look at the wavefront graphite output in this or as a second PR?

@srebhan
Copy link
Member Author

srebhan commented Mar 22, 2024

@powersj here are my findings regarding the graphite output... The current behavior of graphite is to always succeed on startup even though no server can be connected. The plugin will then try to connect to the given servers in each Gather() cycle. If at least one server was connected successfully, the data is written. This is done unconditionally and corresponds to the retry strategy in the new framework with potentially partial startup. Additionally, servers are reconnected if they are found to be unconnected. This is also tried in each Gather() cycle.

Regarding the framework, the conversion of the graphite output has some drawbacks:

  1. To keep the current behavior, we would need to set startup_error_behavior to retry by default. This is difficult and can only be done with a special handling in the config code.
  2. The "reconnect disconnected servers" behavior currently implemented in the plugin is currently not covered by the framework code.

So given the challenges above and the little gain in terms of code-removal I suggest to not convert the graphite output for now, at least until we also handle the reconnect case in the framework...

@powersj powersj removed their assignment Mar 22, 2024
@powersj
Copy link
Contributor

powersj commented Mar 22, 2024

@DStrand1 all yours to do a final review.

@DStrand1 DStrand1 merged commit aa030b5 into influxdata:master Mar 26, 2024
27 checks passed
@github-actions github-actions bot added this to the v1.31.0 milestone Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent area/kafka feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generic handling to ignore errors on startup Two more Kafka input failures that break all other plugins
3 participants