Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager: Allow sharding of alertmanager tenants #3664

Merged
merged 4 commits into from
Jan 19, 2021

Conversation

gotjosh
Copy link
Contributor

@gotjosh gotjosh commented Jan 8, 2021

What this PR does:

The first part of the proposed as part of #3574, introduces sharding via the ring for the Alertmanager component.

I have a few things to test and a couple of doubts, I'm going to keep a checklist here for major visibility:

  • Do we need to mark the flag as experimental? No.
  • Do I need a changelog entry now? Yes.
  • How does this behave with the current clustering logic, perhaps disabling clustering while sharding is the way to go until we're done.
  • Add an integration test for both the sharding and sharding + clustering.
  • Do I need any additional documentation now or can that be added later on?

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@gotjosh gotjosh mentioned this pull request Jan 8, 2021
3 tasks
@gotjosh gotjosh force-pushed the alertmanager-ring branch from da60e34 to 734cf39 Compare January 9, 2021 20:41
pstibrany
pstibrany previously approved these changes Jan 11, 2021
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This follows the same design as ruler, except we only shard by userID, so shuffle-sharding is not necessary.

pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/ring/model_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add /alertmanager/ring admin endpoint for this like other rings, please?

@gotjosh
Copy link
Contributor Author

gotjosh commented Jan 11, 2021

@pstibrany comments addressed. Also renamed a few flags/config to match the most recent similar ones.

@codesome done!

@gotjosh gotjosh marked this pull request as ready for review January 12, 2021 01:26
@pstibrany pstibrany self-requested a review January 12, 2021 16:52
@pstibrany
Copy link
Contributor

pstibrany commented Jan 13, 2021

I don't see a way to retract my approval since it's still a draft (found it), but I plan to recheck this PR in light of recent issues with default configuration (somewhat related to #3679). That is, what happens when alertmanager no longer owns the tenant, but it receives HTTP calls for it – it should definitely not upload blank config file to the store. I want to double check that.

@pstibrany pstibrany dismissed their stale review January 13, 2021 07:48

I plan to recheck PR for what happens with API calls when AM doesn't own the user.

@pstibrany
Copy link
Contributor

Also, please rebase to use changed ring operation code -- ring operation should now be defined directly in alertmanager package.

Copy link
Contributor

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed when trying this PR that we are not setting the port in the ring config before getting the lifecycler config. Hence setting port 0 in the ring for AM's entry.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need a changelog entry now?

I would say yes.

Do we need to mark the flag as experimental?

Not the flag (we don't do anymore because it's pain once the feature moves to stable), but you should mention the feature is experimental in docs/configuration/v1-guarantees.md.

pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
integration/configs.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/replication_strategy.go Outdated Show resolved Hide resolved
@@ -685,6 +685,8 @@ func (t *Cortex) initConfig() (serv services.Service, err error) {
}

func (t *Cortex) initAlertManager() (serv services.Service, err error) {
t.Cfg.Alertmanager.Ring.ListenPort = t.Cfg.Server.HTTPListenPort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the port advertised in the ring. Given the ring is just used internally, shouldn't be the GRPC port like all other services? Why do we need to advertise the HTTP port?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do. I'm not entirely sure what using a different port entitles, but this ring is also used by the distributors. Could you point me in the right direction to understand how port exposure makes a difference here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @codesome

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently AM is serving only on http requests. So for a start AM distributors would be contacting AM via http (to keep the changes simpler). So should it not be http port for now? And later when/if we switch to using grpc, use the grpc port in ring?

pkg/alertmanager/multitenant.go Show resolved Hide resolved
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another pass and left more comments (mostly minor nits). I haven't reviewed tests yet: will do it in the next pass, but I would prefer to see these comments addressed first.

pkg/alertmanager/alertmanager_ring.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
Comment on lines 513 to 516
level.Debug(am.logger).Log("msg", "configuration owned", "user", userID)
ownedConfigs[userID] = cfg
} else {
level.Debug(am.logger).Log("msg", "configuration not owned, ignoring", "user", userID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these debug logs really useful? I'm a bit dubious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ruler has this exact same logs and in the past, they've proven to be useful in understanding who owns what. Configuration mishaps might happen at runtime that would not end up running a particular tenant's instance.

I'd advocate to keep them in.

@gotjosh gotjosh force-pushed the alertmanager-ring branch 4 times, most recently from 87c60ab to b36a45b Compare January 18, 2021 19:04
@gotjosh gotjosh requested a review from pracucci January 18, 2021 19:41
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! LGTM, modulo few last nits and the memberlist setup fix in modules.go. I've also merged the PR #3677. Could you rebase, please?

docs/configuration/v1-guarantees.md Outdated Show resolved Hide resolved
pkg/cortex/modules.go Show resolved Hide resolved
pkg/ring/replication_strategy_test.go Outdated Show resolved Hide resolved
pkg/ring/replication_strategy_test.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Show resolved Hide resolved
pkg/alertmanager/multitenant_test.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant_test.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant_test.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant_test.go Outdated Show resolved Hide resolved
return metrics.GetSumOfCounters("cortex_alertmanager_sync_configs_total")
})
} else {
time.Sleep(250 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this sleep? We try to avoid sleeps in tests because are a source of flakyness. Can be converted into a test.Poll()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried originally, but couldn't. The problem here is that we have no way of saying "wait for at least X polls before checking". 250ms is 2.5x the time needed for sync to trigger so hopefully this doesn't become a flake for a while..

The assertion below is comparing that the value did not change because only the initial sync happened. Hence a value of 1.

@gotjosh gotjosh force-pushed the alertmanager-ring branch 2 times, most recently from 6da94b8 to cfeff14 Compare January 19, 2021 15:54
The first part of the proposed as part of cortexproject#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>
Signed-off-by: gotjosh <josue@grafana.com>
Signed-off-by: gotjosh <josue@grafana.com>
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a nit

RF, LiveIngesters, DeadIngesters int
ExpectedMaxFailure int
ExpectedError string
replifcationFactor, liveIngesters, deadIngesters int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replifcationFactor > replicationFactor

Signed-off-by: gotjosh <josue@grafana.com>
@pracucci pracucci merged commit 3aba107 into cortexproject:master Jan 19, 2021
@gotjosh gotjosh deleted the alertmanager-ring branch February 24, 2021 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants