Skip to content

Add gossip interval flag for alertmanger #3667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

simonswine
Copy link
Contributor

Signed-off-by: Christian Simon simon@swine.de

What this PR does:

This allows to configure a different gossip interval for alertmanager. This allow to adapt the default of 200ms to different values, which can help to address delayed gossip notifications in high volume and latency deployments.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@simonswine simonswine force-pushed the 20210108_alertmanager_gossip_interval branch from dbc2fe0 to 924699a Compare January 8, 2021 10:25
@@ -117,6 +118,7 @@ func (cfg *MultitenantAlertmanagerConfig) RegisterFlags(f *flag.FlagSet) {

f.StringVar(&cfg.ClusterBindAddr, "cluster.listen-address", defaultClusterAddr, "Listen address for cluster.")
f.StringVar(&cfg.ClusterAdvertiseAddr, "cluster.advertise-address", "", "Explicit address to advertise in cluster.")
f.DurationVar(&cfg.ClusterGossipInterval, "cluster.gossip-interval", cluster.DefaultGossipInterval, "The interval between sending messages that need to be gossiped that haven't been able to piggyback on probing messages. If this is set to zero, non-piggyback gossip is disabled. By lowering this value (more frequent) gossip messages are propagated across cluster more quickly at the expense of increased bandwidth.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too sure why we are not using an -alertmanager. prefix for all of those config flags, as currently it might mislead users to think it is affecting other parts of the system to (e.g. memberlist)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is non-piggyback gossip?

If we allow configuring this value (which I think is already pretty low for gossip, and wouldn't recommend going to lower values), we should also enable configuration of pull/push interval for faster state convergence. (In addition to that, configuring GossipNodes would be useful too, but that doesn't seem configurable in AM).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too sure why we are not using an -alertmanager. prefix for all of those config flags, as currently it might mislead users to think it is affecting other parts of the system to (e.g. memberlist)

I think it's a bug and we should fix it (but need to address it in a separate PR). I've opened an issue in the meanwhile: #3670

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have filed #3677 to fix #3670. Would be great to align these two PRs to make sure the nomenclature and prefix make sense on both. Happy to merge this as is and I can quickly fix it the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is non-piggyback gossip?

That was a copy paste from the memberlist code comment. I think it is confusing more than helping so I reverted to the alertmanager description.

pull/push interval

Added that

Gossip Nodes

Yes that seems to be fixed at quorum or 3 whatever is bigger: https://github.com/prometheus/alertmanager/blob/ce108378d4c8d580804452dcc3a31222067793b2/cluster/cluster.go#L184-L187

Aliging to #3677

I have added some comments here (but meanwhile I used HA in the same way)

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@@ -117,6 +118,7 @@ func (cfg *MultitenantAlertmanagerConfig) RegisterFlags(f *flag.FlagSet) {

f.StringVar(&cfg.ClusterBindAddr, "cluster.listen-address", defaultClusterAddr, "Listen address for cluster.")
f.StringVar(&cfg.ClusterAdvertiseAddr, "cluster.advertise-address", "", "Explicit address to advertise in cluster.")
f.DurationVar(&cfg.ClusterGossipInterval, "cluster.gossip-interval", cluster.DefaultGossipInterval, "The interval between sending messages that need to be gossiped that haven't been able to piggyback on probing messages. If this is set to zero, non-piggyback gossip is disabled. By lowering this value (more frequent) gossip messages are propagated across cluster more quickly at the expense of increased bandwidth.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is non-piggyback gossip?

If we allow configuring this value (which I think is already pretty low for gossip, and wouldn't recommend going to lower values), we should also enable configuration of pull/push interval for faster state convergence. (In addition to that, configuring GossipNodes would be useful too, but that doesn't seem configurable in AM).

@simonswine simonswine force-pushed the 20210108_alertmanager_gossip_interval branch from 924699a to c238f33 Compare January 12, 2021 09:40
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I've a comment about config naming.

@pracucci
Copy link
Contributor

@simonswine We merged #3677. Could you rebase master and move your new config options under the new ClusterConfig struct, please?

The interval between sending messages that need to be gossiped that
haven't been able to piggyback on probing messages. If this is set to
zero, non-piggyback gossip is disabled. By lowering this value (more
frequent) gossip messages are propagated across cluster more quickly at
the expense of increased bandwidth.

Signed-off-by: Christian Simon <simon@swine.de>
@simonswine simonswine force-pushed the 20210108_alertmanager_gossip_interval branch from c238f33 to 714f10c Compare January 19, 2021 14:30
@pull-request-size pull-request-size bot added size/M and removed size/S labels Jan 19, 2021
@simonswine
Copy link
Contributor Author

@pracucci this is ready for another review now

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you did great. Thanks!

@pracucci pracucci merged commit 60ab544 into cortexproject:master Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants