-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alertmanager: Allow sharding of alertmanager tenants #3664
Conversation
10a4c8b
to
7eaedaf
Compare
da60e34
to
734cf39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This follows the same design as ruler, except we only shard by userID, so shuffle-sharding is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add /alertmanager/ring
admin endpoint for this like other rings, please?
@pstibrany comments addressed. Also renamed a few flags/config to match the most recent similar ones. @codesome done! |
|
I plan to recheck PR for what happens with API calls when AM doesn't own the user.
Also, please rebase to use changed ring operation code -- ring operation should now be defined directly in alertmanager package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed when trying this PR that we are not setting the port in the ring config before getting the lifecycler config. Hence setting port 0
in the ring for AM's entry.
ff5eb1e
to
416adca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I need a changelog entry now?
I would say yes.
Do we need to mark the flag as experimental?
Not the flag (we don't do anymore because it's pain once the feature moves to stable), but you should mention the feature is experimental in docs/configuration/v1-guarantees.md
.
pkg/cortex/modules.go
Outdated
@@ -685,6 +685,8 @@ func (t *Cortex) initConfig() (serv services.Service, err error) { | |||
} | |||
|
|||
func (t *Cortex) initAlertManager() (serv services.Service, err error) { | |||
t.Cfg.Alertmanager.Ring.ListenPort = t.Cfg.Server.HTTPListenPort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the port advertised in the ring. Given the ring is just used internally, shouldn't be the GRPC port like all other services? Why do we need to advertise the HTTP port?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do. I'm not entirely sure what using a different port entitles, but this ring is also used by the distributors. Could you point me in the right direction to understand how port exposure makes a difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @codesome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently AM is serving only on http requests. So for a start AM distributors would be contacting AM via http (to keep the changes simpler). So should it not be http port for now? And later when/if we switch to using grpc, use the grpc port in ring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another pass and left more comments (mostly minor nits). I haven't reviewed tests yet: will do it in the next pass, but I would prefer to see these comments addressed first.
pkg/alertmanager/multitenant.go
Outdated
level.Debug(am.logger).Log("msg", "configuration owned", "user", userID) | ||
ownedConfigs[userID] = cfg | ||
} else { | ||
level.Debug(am.logger).Log("msg", "configuration not owned, ignoring", "user", userID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these debug logs really useful? I'm a bit dubious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ruler has this exact same logs and in the past, they've proven to be useful in understanding who owns what. Configuration mishaps might happen at runtime that would not end up running a particular tenant's instance.
I'd advocate to keep them in.
87c60ab
to
b36a45b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! LGTM, modulo few last nits and the memberlist setup fix in modules.go
. I've also merged the PR #3677. Could you rebase, please?
return metrics.GetSumOfCounters("cortex_alertmanager_sync_configs_total") | ||
}) | ||
} else { | ||
time.Sleep(250 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this sleep? We try to avoid sleeps in tests because are a source of flakyness. Can be converted into a test.Poll()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried originally, but couldn't. The problem here is that we have no way of saying "wait for at least X polls before checking". 250ms is 2.5x the time needed for sync to trigger so hopefully this doesn't become a flake for a while..
The assertion below is comparing that the value did not change because only the initial sync happened. Hence a value of 1.
6da94b8
to
cfeff14
Compare
The first part of the proposed as part of cortexproject#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com>
cfeff14
to
ef19bda
Compare
Signed-off-by: gotjosh <josue@grafana.com>
Signed-off-by: gotjosh <josue@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a nit
RF, LiveIngesters, DeadIngesters int | ||
ExpectedMaxFailure int | ||
ExpectedError string | ||
replifcationFactor, liveIngesters, deadIngesters int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replifcationFactor
> replicationFactor
Signed-off-by: gotjosh <josue@grafana.com>
What this PR does:
The first part of the proposed as part of #3574, introduces sharding via the ring for the Alertmanager component.
I have a few things to test and a couple of doubts, I'm going to keep a checklist here for major visibility:
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]