Alertmanager: Allow sharding of alertmanager tenants #3664

gotjosh · 2021-01-08T00:20:52Z

What this PR does:

The first part of the proposed as part of #3574, introduces sharding via the ring for the Alertmanager component.

I have a few things to test and a couple of doubts, I'm going to keep a checklist here for major visibility:

Do we need to mark the flag as experimental? No.
Do I need a changelog entry now? Yes.
How does this behave with the current clustering logic, perhaps disabling clustering while sharding is the way to go until we're done.
Add an integration test for both the sharding and sharding + clustering.
Do I need any additional documentation now or can that be added later on?

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/alertmanager/multitenant.go

pstibrany

LGTM. This follows the same design as ruler, except we only shard by userID, so shuffle-sharding is not necessary.

pkg/alertmanager/multitenant.go

pkg/ring/model_test.go

codesome

Can we also add /alertmanager/ring admin endpoint for this like other rings, please?

gotjosh · 2021-01-11T17:11:23Z

@pstibrany comments addressed. Also renamed a few flags/config to match the most recent similar ones.

@codesome done!

pstibrany · 2021-01-13T07:46:55Z

~~I don't see a way to retract my approval since it's still a draft~~ (found it), but I plan to recheck this PR in light of recent issues with default configuration (somewhat related to #3679). That is, what happens when alertmanager no longer owns the tenant, but it receives HTTP calls for it – it should definitely not upload blank config file to the store. I want to double check that.

I plan to recheck PR for what happens with API calls when AM doesn't own the user.

pstibrany · 2021-01-13T07:49:43Z

Also, please rebase to use changed ring operation code -- ring operation should now be defined directly in alertmanager package.

pkg/alertmanager/alertmanager_ring.go

codesome

I just noticed when trying this PR that we are not setting the port in the ring config before getting the lifecycler config. Hence setting port 0 in the ring for AM's entry.

pracucci

Do I need a changelog entry now?

I would say yes.

Do we need to mark the flag as experimental?

Not the flag (we don't do anymore because it's pain once the feature moves to stable), but you should mention the feature is experimental in docs/configuration/v1-guarantees.md.

pkg/alertmanager/multitenant.go

integration/configs.go

pkg/alertmanager/multitenant.go

pkg/alertmanager/replication_strategy.go

pracucci · 2021-01-15T11:22:32Z

pkg/cortex/modules.go

@@ -685,6 +685,8 @@ func (t *Cortex) initConfig() (serv services.Service, err error) {
 }

 func (t *Cortex) initAlertManager() (serv services.Service, err error) {
+	t.Cfg.Alertmanager.Ring.ListenPort = t.Cfg.Server.HTTPListenPort


This is the port advertised in the ring. Given the ring is just used internally, shouldn't be the GRPC port like all other services? Why do we need to advertise the HTTP port?

I think we do. I'm not entirely sure what using a different port entitles, but this ring is also used by the distributors. Could you point me in the right direction to understand how port exposure makes a difference here?

cc: @codesome

Currently AM is serving only on http requests. So for a start AM distributors would be contacting AM via http (to keep the changes simpler). So should it not be http port for now? And later when/if we switch to using grpc, use the grpc port in ring?

pkg/alertmanager/multitenant.go

pracucci

I did another pass and left more comments (mostly minor nits). I haven't reviewed tests yet: will do it in the next pass, but I would prefer to see these comments addressed first.

pkg/alertmanager/alertmanager_ring.go

pkg/alertmanager/multitenant.go

pracucci · 2021-01-15T11:58:57Z

pkg/alertmanager/multitenant.go

+			level.Debug(am.logger).Log("msg", "configuration owned", "user", userID)
+			ownedConfigs[userID] = cfg
+		} else {
+			level.Debug(am.logger).Log("msg", "configuration not owned, ignoring", "user", userID)


Are these debug logs really useful? I'm a bit dubious.

The ruler has this exact same logs and in the past, they've proven to be useful in understanding who owns what. Configuration mishaps might happen at runtime that would not end up running a particular tenant's instance.

I'd advocate to keep them in.

pracucci

Good job! LGTM, modulo few last nits and the memberlist setup fix in modules.go. I've also merged the PR #3677. Could you rebase, please?

docs/configuration/v1-guarantees.md

pkg/cortex/modules.go

pkg/ring/replication_strategy_test.go

pkg/alertmanager/multitenant.go

pkg/alertmanager/multitenant_test.go

pracucci · 2021-01-19T11:29:46Z

pkg/alertmanager/multitenant_test.go

+					return metrics.GetSumOfCounters("cortex_alertmanager_sync_configs_total")
+				})
+			} else {
+				time.Sleep(250 * time.Millisecond)


Why this sleep? We try to avoid sleeps in tests because are a source of flakyness. Can be converted into a test.Poll()?

I tried originally, but couldn't. The problem here is that we have no way of saying "wait for at least X polls before checking". 250ms is 2.5x the time needed for sync to trigger so hopefully this doesn't become a flake for a while..

The assertion below is comparing that the value did not change because only the initial sync happened. Hence a value of 1.

The first part of the proposed as part of cortexproject#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com>

Signed-off-by: gotjosh <josue@grafana.com>

pstibrany

👍

pracucci

LGTM with a nit

pracucci · 2021-01-19T17:36:17Z

pkg/ring/replication_strategy_test.go

-		RF, LiveIngesters, DeadIngesters int
-		ExpectedMaxFailure               int
-		ExpectedError                    string
+		replifcationFactor, liveIngesters, deadIngesters int


replifcationFactor > replicationFactor

Signed-off-by: gotjosh <josue@grafana.com>

pull-request-size bot added the size/XXL label Jan 8, 2021

gotjosh force-pushed the alertmanager-ring branch from 10a4c8b to 7eaedaf Compare January 8, 2021 00:21

gotjosh commented Jan 8, 2021

View reviewed changes

pkg/alertmanager/multitenant.go Show resolved Hide resolved

gotjosh mentioned this pull request Jan 8, 2021

Alertmanager Distributor #3671

Merged

3 tasks

gotjosh force-pushed the alertmanager-ring branch from da60e34 to 734cf39 Compare January 9, 2021 20:41

pstibrany previously approved these changes Jan 11, 2021

View reviewed changes

pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved

pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved

pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved

pkg/ring/model_test.go Outdated Show resolved Hide resolved

codesome reviewed Jan 11, 2021

View reviewed changes

pstibrany mentioned this pull request Jan 11, 2021

Move ring operations to packages where they are used. #3675

Merged

1 task

gotjosh marked this pull request as ready for review January 12, 2021 01:26

pstibrany self-requested a review January 12, 2021 16:52

codesome reviewed Jan 14, 2021

View reviewed changes

pkg/alertmanager/alertmanager_ring.go Outdated Show resolved Hide resolved

codesome reviewed Jan 14, 2021

View reviewed changes

gotjosh force-pushed the alertmanager-ring branch from ff5eb1e to 416adca Compare January 15, 2021 00:49

pracucci reviewed Jan 15, 2021

View reviewed changes

gotjosh force-pushed the alertmanager-ring branch 4 times, most recently from 87c60ab to b36a45b Compare January 18, 2021 19:04

gotjosh requested a review from pracucci January 18, 2021 19:41

pracucci approved these changes Jan 19, 2021

View reviewed changes

gotjosh force-pushed the alertmanager-ring branch 2 times, most recently from 6da94b8 to cfeff14 Compare January 19, 2021 15:54

Alertmanager: Allow sharding of alertmanager tenants

ef19bda

The first part of the proposed as part of cortexproject#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com>

gotjosh force-pushed the alertmanager-ring branch from cfeff14 to ef19bda Compare January 19, 2021 16:12

Appease the linter

dffe8c0

Signed-off-by: gotjosh <josue@grafana.com>

Update CHANGELOG to warn about Alertmanager sharding

ca3f7f7

Signed-off-by: gotjosh <josue@grafana.com>

pstibrany approved these changes Jan 19, 2021

View reviewed changes

pracucci approved these changes Jan 19, 2021

View reviewed changes

Fix, last typo.

1bdba7d

Signed-off-by: gotjosh <josue@grafana.com>

pracucci merged commit 3aba107 into cortexproject:master Jan 19, 2021

gotjosh mentioned this pull request Feb 10, 2021

Alertmanager: AM can deactivate manager for user using fallback config #3666

Open

gotjosh mentioned this pull request Feb 18, 2021

Alertmanager: Replicate state using the Ring #3839

Merged

3 tasks

gotjosh deleted the alertmanager-ring branch February 24, 2021 13:46

Alertmanager: Allow sharding of alertmanager tenants #3664

Alertmanager: Allow sharding of alertmanager tenants #3664

Uh oh!

Conversation

gotjosh commented Jan 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pstibrany left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codesome left a comment

Choose a reason for hiding this comment

Uh oh!

gotjosh commented Jan 11, 2021

Uh oh!

pstibrany commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pstibrany commented Jan 13, 2021

Uh oh!

Uh oh!

codesome left a comment

Choose a reason for hiding this comment

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gotjosh commented Jan 8, 2021 •

edited

Loading

pstibrany commented Jan 13, 2021 •

edited

Loading