Implement gRPC based initial state settling in alertmanager. #3925

stevesg · 2021-03-08T12:26:19Z

What this PR does:
Implements setting the alertmanager state by obtaining copies of the full state from replicas over gRPC. This is only used when the new "Sharding" mode of alertmanager is enabled, and is otherwise unused.

The work is broken up into commits, which can be broken out into separate PRs if preferred, as each change is tested independently with unit tests.

Depends on: #3958
Fixes: #3927

Checklist

Tests updated
~~Documentation added~~
~~CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]~~

stevesg · 2021-03-15T10:50:38Z

Rebase + updates from refactoring

stevesg · 2021-03-16T20:49:02Z

Rebase

stevesg · 2021-03-17T07:57:43Z

Complete - just looking into unit test failure.

pracucci

Very good job! The logic flow is pretty clear. I left few nits.

My only question is whether we would end up in a situation where we could use the alertmanager before settlement is done (see comment in Alertmanager.New()).

Also, if initial state settlement works correctly, you should be able to remove the t.Skip() from the integration test TestAlertmanagerSharding and mention this PR fixes #3927, correct?

pkg/alertmanager/alertmanager.go

pkg/alertmanager/multitenant.go

pkg/alertmanager/state_replication.go

pkg/alertmanager/state_replication_test.go

pracucci · 2021-03-17T17:46:39Z

pkg/alertmanager/alertmanager.go

@@ -203,6 +199,13 @@ func New(cfg *Config, reg *prometheus.Registry) (*Alertmanager, error) {
 	c = am.state.AddState("sil:"+cfg.UserID, am.silences, am.registry)
 	am.silences.SetBroadcast(c.Broadcast)

+	// State replication needs to be started after the state keys are defined.
+	if service, ok := am.state.(services.Service); ok {
+		if err := service.StartAsync(context.Background()); err != nil {


We don't wait until started here (and it's correct). However, this means that we may start using this Alertmanager instance before settlement is completed (and I believe this is not correct). Am I missing anything?

I have been assuming we want to settle in the background, because an Alertmanager is spun up for every tenant. If they all have to hit the timeout, that might take too long if done serially. That being said, it's harder to reason about correctness in this case

Perhaps safer to change it to wait for now, and explore doing it in the background as a separate piece of work?

It's worth noting that "start using" is not as one would think. Yes, we'll accept alerts, silences, etc. However, we'll wait for the state to be replicated before we send a notification.

Will leave as-is. My (current) understanding is as Josh said - there is no requirement to block (except for notifications, which are blocked via the call into WaitReady).

stevesg

Thanks for the review!

Yes I'm just experimenting with that test today to see if it's reliable enough to enable now.

stevesg · 2021-03-18T10:01:03Z

pkg/alertmanager/alertmanager.go

@@ -203,6 +199,13 @@ func New(cfg *Config, reg *prometheus.Registry) (*Alertmanager, error) {
 	c = am.state.AddState("sil:"+cfg.UserID, am.silences, am.registry)
 	am.silences.SetBroadcast(c.Broadcast)

+	// State replication needs to be started after the state keys are defined.
+	if service, ok := am.state.(services.Service); ok {
+		if err := service.StartAsync(context.Background()); err != nil {


I have been assuming we want to settle in the background, because an Alertmanager is spun up for every tenant. If they all have to hit the timeout, that might take too long if done serially. That being said, it's harder to reason about correctness in this case

Perhaps safer to change it to wait for now, and explore doing it in the background as a separate piece of work?

pkg/alertmanager/multitenant.go

stevesg · 2021-03-18T10:08:26Z

pkg/alertmanager/state_replication.go

@@ -41,6 +47,14 @@ type state struct {
 // newReplicatedStates creates a new state struct, which manages state to be replicated between alertmanagers.
 func newReplicatedStates(userID string, rf int, re Replicator, l log.Logger, r prometheus.Registerer) *state {

+	defaultSettleConfig := util.BackoffConfig{


I don't really have a good answer for this, other than that was the plan. Maybe @gotjosh has an opinion?

I suppose we want to try and cover transient errors as best as possible? But given we're using a more reliable means of fetching the state, which itself has a timeout, and as you say, we have another fallback to S3, maybe we can take the retries away entirely.

pkg/alertmanager/state_replication.go

pkg/alertmanager/state_replication_test.go

pkg/alertmanager/multitenant.go

stevesg · 2021-03-18T11:05:05Z

Rebase

stevesg · 2021-03-18T13:46:30Z

Addressed review comments.

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

pkg/alertmanager/alertmanager.go

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

stevesg · 2021-03-18T16:26:59Z

Rebase

stevesg · 2021-03-18T17:19:06Z

I've done some testing locally on #3927 and before this change it failed 3/20, and afterwards 0/20. So looking good but I want to check through the logs to check we're working as expected, then run a few more iterations before re-enabling the test.

pracucci

Very good job! LGTM (modulo last comments). Can you remove the skip from the integration test TestAlertmanagerSharding so we make sure it passes?

pkg/alertmanager/multitenant.go

pkg/alertmanager/state_replication_test.go

pkg/ring/replication_set.go

pkg/alertmanager/multitenant.go

pracucci · 2021-03-22T15:12:04Z

pkg/alertmanager/state_replication.go

 	// We can check other alertmanager(s) and explicitly ask them to propagate their state to us if available.
+	backoff := util.NewBackoff(ctx, s.settleBackoff)


From a previous comment. Steve wrote:

I suppose we want to try and cover transient errors as best as possible? But given we're using a more reliable means of fetching the state, which itself has a timeout, and as you say, we have another fallback to S3, maybe we can take the retries away entirely.

I'm leaning towards remove retries entirely, unless we have a good reason to keep it, because of the reasons already mentioned.

The concept of retries in this scenario is more about making sure we've communicated with other replicas than it is about recovering from failures.

What we care about here is two things:

Make sure we've communicated with other replicas (if any)

If no state was available from other replicas then try object storage

On scale up / downs retries might not matter all that much, but on total cluster failure or cluster startup I feel like we do care about retrying because other replicas might also starting up.

I guess the main thing to address as part of this logic IMO is that we should keep track of whenever len(fullStates) == replication factor and if is we should carry on.

I suggest for now we leave the retries out, and decide if/how to add them back in once we have some more context to work from.

pkg/alertmanager/state_replication.go

pkg/alertmanager/multitenant_test.go

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

pkg/alertmanager/alertmanager.go

pkg/ring/replication_set.go

gotjosh

LGTM Steve 👏, I have one question on the readiness and re-broadcasting and I think we need a decision on the semantics of "retries".

gotjosh · 2021-03-23T17:38:05Z

pkg/alertmanager/state_replication.go

@@ -154,14 +224,21 @@ func (s *state) running(ctx context.Context) error {
 	}
 }

+func (s *state) broadcast(key string, b []byte) {
+	// We should ignore the Merges into the initial state during settling.
+	if s.Ready() {


Do we need this? IIRC, we decided to go with a no rebroadcasting of state on calls to Merge under that scheme this would never occur unless we received a request directly to this Alertmanager

Good spot - yes we do need it for now, until the re-broadcasting is disabled. I wanted to do it before this commit, but it seems that disabling it is not trivial as we thought and will involve upstream changes.

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

gotjosh

LGTM

pracucci

Thanks for addressing all comments. LGTM! 🚀

In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes #2245 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Restore alertmanager state from storage as fallback In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes #2245 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> * Code review changes Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Restore alertmanager state from storage as fallback In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes grafana#2245 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> * Code review changes Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

pull-request-size bot added the size/L label Mar 8, 2021

stevesg force-pushed the am-settling branch 3 times, most recently from 6361d12 to c2771c7 Compare March 15, 2021 10:50

stevesg force-pushed the am-settling branch from c2771c7 to acda9f1 Compare March 16, 2021 16:59

pull-request-size bot added size/XL and removed size/L labels Mar 16, 2021

stevesg force-pushed the am-settling branch from acda9f1 to 8e74137 Compare March 16, 2021 20:00

stevesg changed the title ~~Implement Settle() function in alert manager state_replication.~~ Implement gRPC based initial state settling in alertmanager. Mar 16, 2021

stevesg force-pushed the am-settling branch from 8e74137 to 7fd4949 Compare March 16, 2021 20:47

stevesg marked this pull request as ready for review March 17, 2021 07:57

stevesg force-pushed the am-settling branch from 7fd4949 to 86f89fc Compare March 17, 2021 11:20

pracucci reviewed Mar 17, 2021

View reviewed changes

stevesg commented Mar 18, 2021

View reviewed changes

stevesg force-pushed the am-settling branch from 86f89fc to 90a77d6 Compare March 18, 2021 11:04

stevesg force-pushed the am-settling branch from 90a77d6 to e05ccf1 Compare March 18, 2021 13:38

Implement settling retry logic in alert manager state_replication.

bc41a4c

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

stevesg commented Mar 18, 2021

View reviewed changes

pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved

stevesg added 3 commits March 18, 2021 17:11

Add ReadFullState call to alertmanager gRPC interface.

19f71ae

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

Prevent broadcasting state obtained while starting up (settling).

aed52b3

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

Implement ReadFullStateForUser in MultitenantAlertmanager.

77afe99

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

stevesg force-pushed the am-settling branch from e05ccf1 to 77afe99 Compare March 18, 2021 16:25

stevesg mentioned this pull request Mar 19, 2021

TestAlertmanagerSharding is flaky due to a logic issue #3927

Closed

pracucci reviewed Mar 22, 2021

View reviewed changes

stevesg added 2 commits March 23, 2021 10:33

Review comments.

069b432

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

Review comments - remove retry logic.

2bdac61

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

stevesg force-pushed the am-settling branch from c6cdbea to 2bdac61 Compare March 23, 2021 09:34

Enable alertmanager sharding integration test.

282744a

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

gotjosh reviewed Mar 23, 2021

View reviewed changes

pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved

gotjosh reviewed Mar 23, 2021

View reviewed changes

pkg/ring/replication_set.go Outdated Show resolved Hide resolved

gotjosh reviewed Mar 23, 2021

View reviewed changes

Review comments.

4c8e4fa

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

stevesg force-pushed the am-settling branch from 9cf991c to 4c8e4fa Compare March 24, 2021 13:53

gotjosh approved these changes Mar 24, 2021

View reviewed changes

pracucci approved these changes Mar 25, 2021

View reviewed changes

pstibrany approved these changes Mar 25, 2021

View reviewed changes

pracucci merged commit c05eb36 into cortexproject:master Mar 25, 2021

stevesg mentioned this pull request Mar 25, 2021

Apply review comments consistently to alertmanager/multitenant.go #4005

Merged

56quarters mentioned this pull request Jun 30, 2022

Alertmanager state is not restored from remote storage if replicationFactor == 1 grafana/mimir#2245

Closed

56quarters mentioned this pull request Jun 30, 2022

Restore alertmanager state from storage as fallback grafana/mimir#2293

Merged

2 tasks

		// We can check other alertmanager(s) and explicitly ask them to propagate their state to us if available.
		backoff := util.NewBackoff(ctx, s.settleBackoff)

Implement gRPC based initial state settling in alertmanager. #3925

Implement gRPC based initial state settling in alertmanager. #3925

Uh oh!

Conversation

stevesg commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevesg commented Mar 15, 2021

Uh oh!

stevesg commented Mar 16, 2021

Uh oh!

stevesg commented Mar 17, 2021

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevesg Mar 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevesg left a comment

Choose a reason for hiding this comment

Uh oh!

stevesg Mar 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevesg commented Mar 18, 2021

Uh oh!

stevesg commented Mar 18, 2021

Uh oh!

Uh oh!

stevesg commented Mar 18, 2021

Uh oh!

stevesg commented Mar 18, 2021

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gotjosh left a comment

Choose a reason for hiding this comment

Uh oh!

stevesg commented Mar 8, 2021 •

edited

Loading

stevesg Mar 18, 2021 •

edited

Loading

stevesg Mar 18, 2021 •

edited

Loading