-
Notifications
You must be signed in to change notification settings - Fork 820
Implement gRPC based initial state settling in alertmanager. #3925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6361d12
to
c2771c7
Compare
Rebase + updates from refactoring |
Rebase |
Complete - just looking into unit test failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good job! The logic flow is pretty clear. I left few nits.
My only question is whether we would end up in a situation where we could use the alertmanager before settlement is done (see comment in Alertmanager.New()
).
Also, if initial state settlement works correctly, you should be able to remove the t.Skip()
from the integration test TestAlertmanagerSharding
and mention this PR fixes #3927, correct?
@@ -203,6 +199,13 @@ func New(cfg *Config, reg *prometheus.Registry) (*Alertmanager, error) { | |||
c = am.state.AddState("sil:"+cfg.UserID, am.silences, am.registry) | |||
am.silences.SetBroadcast(c.Broadcast) | |||
|
|||
// State replication needs to be started after the state keys are defined. | |||
if service, ok := am.state.(services.Service); ok { | |||
if err := service.StartAsync(context.Background()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't wait until started here (and it's correct). However, this means that we may start using this Alertmanager
instance before settlement is completed (and I believe this is not correct). Am I missing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been assuming we want to settle in the background, because an Alertmanager
is spun up for every tenant. If they all have to hit the timeout, that might take too long if done serially. That being said, it's harder to reason about correctness in this case
Perhaps safer to change it to wait for now, and explore doing it in the background as a separate piece of work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that "start using" is not as one would think. Yes, we'll accept alerts, silences, etc. However, we'll wait for the state to be replicated before we send a notification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will leave as-is. My (current) understanding is as Josh said - there is no requirement to block (except for notifications, which are blocked via the call into WaitReady
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
Yes I'm just experimenting with that test today to see if it's reliable enough to enable now.
@@ -203,6 +199,13 @@ func New(cfg *Config, reg *prometheus.Registry) (*Alertmanager, error) { | |||
c = am.state.AddState("sil:"+cfg.UserID, am.silences, am.registry) | |||
am.silences.SetBroadcast(c.Broadcast) | |||
|
|||
// State replication needs to be started after the state keys are defined. | |||
if service, ok := am.state.(services.Service); ok { | |||
if err := service.StartAsync(context.Background()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been assuming we want to settle in the background, because an Alertmanager
is spun up for every tenant. If they all have to hit the timeout, that might take too long if done serially. That being said, it's harder to reason about correctness in this case
Perhaps safer to change it to wait for now, and explore doing it in the background as a separate piece of work?
@@ -41,6 +47,14 @@ type state struct { | |||
// newReplicatedStates creates a new state struct, which manages state to be replicated between alertmanagers. | |||
func newReplicatedStates(userID string, rf int, re Replicator, l log.Logger, r prometheus.Registerer) *state { | |||
|
|||
defaultSettleConfig := util.BackoffConfig{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have a good answer for this, other than that was the plan. Maybe @gotjosh has an opinion?
I suppose we want to try and cover transient errors as best as possible? But given we're using a more reliable means of fetching the state, which itself has a timeout, and as you say, we have another fallback to S3, maybe we can take the retries away entirely.
Rebase |
Addressed review comments. |
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Rebase |
I've done some testing locally on #3927 and before this change it failed 3/20, and afterwards 0/20. So looking good but I want to check through the logs to check we're working as expected, then run a few more iterations before re-enabling the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good job! LGTM (modulo last comments). Can you remove the skip from the integration test TestAlertmanagerSharding
so we make sure it passes?
// We can check other alertmanager(s) and explicitly ask them to propagate their state to us if available. | ||
backoff := util.NewBackoff(ctx, s.settleBackoff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a previous comment. Steve wrote:
I suppose we want to try and cover transient errors as best as possible? But given we're using a more reliable means of fetching the state, which itself has a timeout, and as you say, we have another fallback to S3, maybe we can take the retries away entirely.
I'm leaning towards remove retries entirely, unless we have a good reason to keep it, because of the reasons already mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept of retries in this scenario is more about making sure we've communicated with other replicas than it is about recovering from failures.
What we care about here is two things:
- Make sure we've communicated with other replicas (if any)
- If no state was available from other replicas then try object storage
On scale up / downs retries might not matter all that much, but on total cluster failure or cluster startup I feel like we do care about retrying because other replicas might also starting up.
I guess the main thing to address as part of this logic IMO is that we should keep track of whenever len(fullStates) == replication factor
and if is we should carry on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest for now we leave the retries out, and decide if/how to add them back in once we have some more context to work from.
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Steve 👏, I have one question on the readiness and re-broadcasting and I think we need a decision on the semantics of "retries".
@@ -154,14 +224,21 @@ func (s *state) running(ctx context.Context) error { | |||
} | |||
} | |||
|
|||
func (s *state) broadcast(key string, b []byte) { | |||
// We should ignore the Merges into the initial state during settling. | |||
if s.Ready() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? IIRC, we decided to go with a no rebroadcasting of state on calls to Merge
under that scheme this would never occur unless we received a request directly to this Alertmanager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot - yes we do need it for now, until the re-broadcasting is disabled. I wanted to do it before this commit, but it seems that disabling it is not trivial as we thought and will involve upstream changes.
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all comments. LGTM! 🚀
In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes #2245 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes #2245 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
* Restore alertmanager state from storage as fallback In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes #2245 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> * Code review changes Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
* Restore alertmanager state from storage as fallback In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes grafana#2245 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> * Code review changes Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
What this PR does:
Implements setting the alertmanager state by obtaining copies of the full state from replicas over gRPC. This is only used when the new "Sharding" mode of alertmanager is enabled, and is otherwise unused.
The work is broken up into commits, which can be broken out into separate PRs if preferred, as each change is tested independently with unit tests.
Depends on: #3958
Fixes: #3927
Checklist
Documentation addedCHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]