Skip to content

Implement gRPC based initial state settling in alertmanager. #3925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 25, 2021

Conversation

stevesg
Copy link
Contributor

@stevesg stevesg commented Mar 8, 2021

What this PR does:
Implements setting the alertmanager state by obtaining copies of the full state from replicas over gRPC. This is only used when the new "Sharding" mode of alertmanager is enabled, and is otherwise unused.

The work is broken up into commits, which can be broken out into separate PRs if preferred, as each change is tested independently with unit tests.

Depends on: #3958
Fixes: #3927

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@stevesg stevesg force-pushed the am-settling branch 3 times, most recently from 6361d12 to c2771c7 Compare March 15, 2021 10:50
@stevesg
Copy link
Contributor Author

stevesg commented Mar 15, 2021

Rebase + updates from refactoring

@stevesg stevesg changed the title Implement Settle() function in alert manager state_replication. Implement gRPC based initial state settling in alertmanager. Mar 16, 2021
@stevesg
Copy link
Contributor Author

stevesg commented Mar 16, 2021

Rebase

@stevesg stevesg marked this pull request as ready for review March 17, 2021 07:57
@stevesg
Copy link
Contributor Author

stevesg commented Mar 17, 2021

Complete - just looking into unit test failure.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job! The logic flow is pretty clear. I left few nits.

My only question is whether we would end up in a situation where we could use the alertmanager before settlement is done (see comment in Alertmanager.New()).

Also, if initial state settlement works correctly, you should be able to remove the t.Skip() from the integration test TestAlertmanagerSharding and mention this PR fixes #3927, correct?

@@ -203,6 +199,13 @@ func New(cfg *Config, reg *prometheus.Registry) (*Alertmanager, error) {
c = am.state.AddState("sil:"+cfg.UserID, am.silences, am.registry)
am.silences.SetBroadcast(c.Broadcast)

// State replication needs to be started after the state keys are defined.
if service, ok := am.state.(services.Service); ok {
if err := service.StartAsync(context.Background()); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't wait until started here (and it's correct). However, this means that we may start using this Alertmanager instance before settlement is completed (and I believe this is not correct). Am I missing anything?

Copy link
Contributor Author

@stevesg stevesg Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been assuming we want to settle in the background, because an Alertmanager is spun up for every tenant. If they all have to hit the timeout, that might take too long if done serially. That being said, it's harder to reason about correctness in this case

Perhaps safer to change it to wait for now, and explore doing it in the background as a separate piece of work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that "start using" is not as one would think. Yes, we'll accept alerts, silences, etc. However, we'll wait for the state to be replicated before we send a notification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave as-is. My (current) understanding is as Josh said - there is no requirement to block (except for notifications, which are blocked via the call into WaitReady).

Copy link
Contributor Author

@stevesg stevesg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Yes I'm just experimenting with that test today to see if it's reliable enough to enable now.

@@ -203,6 +199,13 @@ func New(cfg *Config, reg *prometheus.Registry) (*Alertmanager, error) {
c = am.state.AddState("sil:"+cfg.UserID, am.silences, am.registry)
am.silences.SetBroadcast(c.Broadcast)

// State replication needs to be started after the state keys are defined.
if service, ok := am.state.(services.Service); ok {
if err := service.StartAsync(context.Background()); err != nil {
Copy link
Contributor Author

@stevesg stevesg Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been assuming we want to settle in the background, because an Alertmanager is spun up for every tenant. If they all have to hit the timeout, that might take too long if done serially. That being said, it's harder to reason about correctness in this case

Perhaps safer to change it to wait for now, and explore doing it in the background as a separate piece of work?

@@ -41,6 +47,14 @@ type state struct {
// newReplicatedStates creates a new state struct, which manages state to be replicated between alertmanagers.
func newReplicatedStates(userID string, rf int, re Replicator, l log.Logger, r prometheus.Registerer) *state {

defaultSettleConfig := util.BackoffConfig{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a good answer for this, other than that was the plan. Maybe @gotjosh has an opinion?

I suppose we want to try and cover transient errors as best as possible? But given we're using a more reliable means of fetching the state, which itself has a timeout, and as you say, we have another fallback to S3, maybe we can take the retries away entirely.

@stevesg
Copy link
Contributor Author

stevesg commented Mar 18, 2021

Rebase

@stevesg
Copy link
Contributor Author

stevesg commented Mar 18, 2021

Addressed review comments.

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
stevesg added 3 commits March 18, 2021 17:11
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
@stevesg
Copy link
Contributor Author

stevesg commented Mar 18, 2021

Rebase

@stevesg
Copy link
Contributor Author

stevesg commented Mar 18, 2021

I've done some testing locally on #3927 and before this change it failed 3/20, and afterwards 0/20. So looking good but I want to check through the logs to check we're working as expected, then run a few more iterations before re-enabling the test.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job! LGTM (modulo last comments). Can you remove the skip from the integration test TestAlertmanagerSharding so we make sure it passes?

// We can check other alertmanager(s) and explicitly ask them to propagate their state to us if available.
backoff := util.NewBackoff(ctx, s.settleBackoff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a previous comment. Steve wrote:

I suppose we want to try and cover transient errors as best as possible? But given we're using a more reliable means of fetching the state, which itself has a timeout, and as you say, we have another fallback to S3, maybe we can take the retries away entirely.

I'm leaning towards remove retries entirely, unless we have a good reason to keep it, because of the reasons already mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept of retries in this scenario is more about making sure we've communicated with other replicas than it is about recovering from failures.

What we care about here is two things:

  • Make sure we've communicated with other replicas (if any)
  • If no state was available from other replicas then try object storage

On scale up / downs retries might not matter all that much, but on total cluster failure or cluster startup I feel like we do care about retrying because other replicas might also starting up.

I guess the main thing to address as part of this logic IMO is that we should keep track of whenever len(fullStates) == replication factor and if is we should carry on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest for now we leave the retries out, and decide if/how to add them back in once we have some more context to work from.

stevesg added 2 commits March 23, 2021 10:33
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Steve 👏, I have one question on the readiness and re-broadcasting and I think we need a decision on the semantics of "retries".

@@ -154,14 +224,21 @@ func (s *state) running(ctx context.Context) error {
}
}

func (s *state) broadcast(key string, b []byte) {
// We should ignore the Merges into the initial state during settling.
if s.Ready() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? IIRC, we decided to go with a no rebroadcasting of state on calls to Merge under that scheme this would never occur unless we received a request directly to this Alertmanager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot - yes we do need it for now, until the re-broadcasting is disabled. I wanted to do it before this commit, but it seems that disabling it is not trivial as we thought and will involve upstream changes.

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all comments. LGTM! 🚀

@pracucci pracucci merged commit c05eb36 into cortexproject:master Mar 25, 2021
56quarters added a commit to grafana/mimir that referenced this pull request Jun 30, 2022
In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes #2245

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit to grafana/mimir that referenced this pull request Jul 5, 2022
In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes #2245

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
pracucci pushed a commit to grafana/mimir that referenced this pull request Jul 6, 2022
* Restore alertmanager state from storage as fallback

In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes #2245

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Code review changes

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
masonmei pushed a commit to udmire/mimir that referenced this pull request Jul 11, 2022
* Restore alertmanager state from storage as fallback

In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes grafana#2245

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Code review changes

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestAlertmanagerSharding is flaky due to a logic issue
4 participants