add configurable final-sleep time for basic lifecycler #5517

wenxu1024 · 2023-08-21T17:12:19Z

What this PR does:

Add a configurable final sleep to the ruler/alertmanager/store-gateway lifecycler. This is to overcome the information delay in the ring if DDB KV is used as the backend.
e.g Ruler-1 is stopping and set it self to LEAVING. The information takes about 30s to reach other ruler, within in this 30s, if other ruler takes a GetRules call, it will iterate all the rulers that it thinks are healthy. It will wrongfully think the LEAVING ruler is healthy. But in reality, the LEAVING ruler already shutdown itself (because we don't have any delay). Then the GetRules will fail (because we don't have HA, and we can not tolerate a single ruler failure)

The solution is to add a delay when the ruler set itself to LEAVING, during this LEAVING time, if other ruler send the request. This LEAVING ruler should be able to handle the request. After other ruler realized this ruler is LEAVING, they will stop to send further request to the LEAVING ruler. So we will be okay from then on.

Similar idea as in the lifeCycler, when the ingester is shutting itself down. I can still do the processShutdown duties and final sleep for 30s to scrape the metrics.

Version 2:
Found out that if we sleep right after the instance is set to LEAVING. Then during this sleep time, other instances know this instance is LEAVING (but this instance is still in the ring), then we have another problem. https://github.com/cortexproject/cortex/blob/master/pkg/ring/ring.go#L556

So the error just changed from unable to retrieve rules from ruler to too many unhealthy instances in the ring

This is not what we wanted.

So, we propose to sleep after the instance is removed from the Ring. Then next heartbeat of the other instances will see this instance is removed. Then we won't have the too many unhealthy instances in the ring

This is a workaround for components that are still not HA yet (eg. ruler)

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

yeya24 · 2023-08-21T23:00:37Z

pkg/storegateway/gateway_ring.go

@@ -109,6 +111,8 @@ func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
 	f.DurationVar(&cfg.WaitStabilityMinDuration, ringFlagsPrefix+"wait-stability-min-duration", time.Minute, "Minimum time to wait for ring stability at startup. 0 to disable.")
 	f.DurationVar(&cfg.WaitStabilityMaxDuration, ringFlagsPrefix+"wait-stability-max-duration", 5*time.Minute, "Maximum time to wait for ring stability at startup. If the store-gateway ring keeps changing after this period of time, the store-gateway will start anyway.")

+	f.DurationVar(&cfg.FinalSleep, ringFlagsPrefix+"final-sleep", 0*time.Second, "The sleep seconds when store-gateway is shutting down.")


Why this has default value 0s but other components have 30s?

because SG we have replications, even we don't sleep and other pods still think this SG is good, it is okay. We will only get one error.

We cannot make such assumptions. Store gateway can still run with replication factor set to 1 if configured.

okay, sure. Set the default to zero for SG, Ruler, and Alertmanager, so it is upto the user to make the right decision for the final delay.

If they don't have HA for ruler,alertmanager and decided to use DDB KV, then the delay need to be larger than DDB puller-sync-time.

If they have HA for ruler,alertmanger, then this finalSleep is not needed. They can afford one error.

Please document this properly if we add this flag

Where should I put the explanation. Kind of long. Tried to add some comments on the flag description. But don't think that is the good place to put long explanation.

My recommendation
"The sleep seconds wait before pod shutdown allowing a delay to propagate ring changes."

@wenxu1024 we need to add this flag in the changelog. There you can explain better. Maybe that is what @yeya24 was referring to.

You can document at https://cortexmetrics.io/docs/configuration/v1guarantees/

yeya24 · 2023-08-21T23:01:44Z

pkg/alertmanager/alertmanager_ring.go

@@ -79,6 +81,7 @@ func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
 	cfg.KVStore.RegisterFlagsWithPrefix(rfprefix, "alertmanagers/", f)
 	f.DurationVar(&cfg.HeartbeatPeriod, rfprefix+"heartbeat-period", 15*time.Second, "Period at which to heartbeat to the ring. 0 = disabled.")
 	f.DurationVar(&cfg.HeartbeatTimeout, rfprefix+"heartbeat-timeout", time.Minute, "The heartbeat timeout after which alertmanagers are considered unhealthy within the ring. 0 = never (timeout disabled).")
+	f.DurationVar(&cfg.FinalSleep, rfprefix+"final-sleep", 30*time.Second, "The sleep seconds when alertmanager is shutting down.")


Is this flag useful mainly for migration between Memberlist to DDB? And not used in other cases?
If it is only useful for a specific scenario, I would prefer to set the default value to 0s, which doesn't change existing behavior

for memberlist this effect is small, because the information can reach other nodes within 2 secs. But for DDB, it takes 30s, we need to sleep to wait other pods know this pod is bad.
@danielblando what do you think? A default zero?

We do have other ring implementations, too. Not only DDB and memberlist.
The value should take all as considerations.
To not break existing cases, we should do 0s and documenting what's the recommended value to set for each ring store

Same as before. Just set default to 0. So default behavior is same as before.

Signed-off-by: Wen Xu <wenxuamz@amazon.com>

…tore Signed-off-by: Wen Xu <wenxuamz@amazon.com>

Signed-off-by: Wen Xu <wenxuamz@amazon.com>

yeya24

Merge on green. Thanks @wenxu1024

pull-request-size bot added the size/M label Aug 21, 2023

wenxu1024 force-pushed the add-delay-in-basic-life-cycler branch from a5bb4d0 to 9f603aa Compare August 21, 2023 21:27

yeya24 reviewed Aug 21, 2023

View reviewed changes

wenxu1024 requested review from yeya24 and danielblando August 24, 2023 15:49

danielblando approved these changes Aug 24, 2023

View reviewed changes

wenxu1024 added 4 commits August 29, 2023 13:44

add configurable final-sleep time for basic lifecycler

b2e21da

Signed-off-by: Wen Xu <wenxuamz@amazon.com>

set final-sleep to zero for integ test

e915a54

Signed-off-by: Wen Xu <wenxuamz@amazon.com>

need to move the sleep after the instance is unregister from the kv s…

a2c9ac8

…tore Signed-off-by: Wen Xu <wenxuamz@amazon.com>

set final sleep default to 0 for ruler,alertmanager,storegateway

b27a4f5

Signed-off-by: Wen Xu <wenxuamz@amazon.com>

wenxu1024 force-pushed the add-delay-in-basic-life-cycler branch from a0d6547 to 0045d7a Compare August 29, 2023 17:46

fix change log

94bcd70

Signed-off-by: Wen Xu <wenxuamz@amazon.com>

wenxu1024 force-pushed the add-delay-in-basic-life-cycler branch from 0045d7a to 94bcd70 Compare August 29, 2023 17:47

yeya24 approved these changes Aug 29, 2023

View reviewed changes

yeya24 merged commit 7a76e51 into cortexproject:master Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add configurable final-sleep time for basic lifecycler #5517

add configurable final-sleep time for basic lifecycler #5517

Uh oh!

wenxu1024 commented Aug 21, 2023 •

edited

Loading

Uh oh!

yeya24 Aug 21, 2023

Uh oh!

wenxu1024 Aug 21, 2023

Uh oh!

yeya24 Aug 22, 2023 •

edited

Loading

Uh oh!

wenxu1024 Aug 22, 2023

Uh oh!

yeya24 Aug 22, 2023 •

edited

Loading

Uh oh!

wenxu1024 Aug 22, 2023

Uh oh!

danielblando Aug 22, 2023

Uh oh!

yeya24 Aug 22, 2023

Uh oh!

yeya24 Aug 21, 2023

Uh oh!

wenxu1024 Aug 21, 2023

Uh oh!

yeya24 Aug 22, 2023 •

edited

Loading

Uh oh!

wenxu1024 Aug 22, 2023

Uh oh!

yeya24 left a comment

Uh oh!

Uh oh!

add configurable final-sleep time for basic lifecycler #5517

add configurable final-sleep time for basic lifecycler #5517

Uh oh!

Conversation

wenxu1024 commented Aug 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeya24 Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeya24 Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeya24 Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeya24 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wenxu1024 commented Aug 21, 2023 •

edited

Loading

yeya24 Aug 22, 2023 •

edited

Loading

yeya24 Aug 22, 2023 •

edited

Loading

yeya24 Aug 22, 2023 •

edited

Loading