Skip to content

add configurable final-sleep time for basic lifecycler #5517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 29, 2023

Conversation

wenxu1024
Copy link
Contributor

@wenxu1024 wenxu1024 commented Aug 21, 2023

What this PR does:

Add a configurable final sleep to the ruler/alertmanager/store-gateway lifecycler. This is to overcome the information delay in the ring if DDB KV is used as the backend.
e.g Ruler-1 is stopping and set it self to LEAVING. The information takes about 30s to reach other ruler, within in this 30s, if other ruler takes a GetRules call, it will iterate all the rulers that it thinks are healthy. It will wrongfully think the LEAVING ruler is healthy. But in reality, the LEAVING ruler already shutdown itself (because we don't have any delay). Then the GetRules will fail (because we don't have HA, and we can not tolerate a single ruler failure)

The solution is to add a delay when the ruler set itself to LEAVING, during this LEAVING time, if other ruler send the request. This LEAVING ruler should be able to handle the request. After other ruler realized this ruler is LEAVING, they will stop to send further request to the LEAVING ruler. So we will be okay from then on.

Similar idea as in the lifeCycler, when the ingester is shutting itself down. I can still do the processShutdown duties and final sleep for 30s to scrape the metrics.


Version 2:
Found out that if we sleep right after the instance is set to LEAVING. Then during this sleep time, other instances know this instance is LEAVING (but this instance is still in the ring), then we have another problem. https://github.com/cortexproject/cortex/blob/master/pkg/ring/ring.go#L556

So the error just changed from unable to retrieve rules from ruler to too many unhealthy instances in the ring

This is not what we wanted.

So, we propose to sleep after the instance is removed from the Ring. Then next heartbeat of the other instances will see this instance is removed. Then we won't have the too many unhealthy instances in the ring

This is a workaround for components that are still not HA yet (eg. ruler)

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@wenxu1024 wenxu1024 force-pushed the add-delay-in-basic-life-cycler branch from a5bb4d0 to 9f603aa Compare August 21, 2023 21:27
@@ -109,6 +111,8 @@ func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
f.DurationVar(&cfg.WaitStabilityMinDuration, ringFlagsPrefix+"wait-stability-min-duration", time.Minute, "Minimum time to wait for ring stability at startup. 0 to disable.")
f.DurationVar(&cfg.WaitStabilityMaxDuration, ringFlagsPrefix+"wait-stability-max-duration", 5*time.Minute, "Maximum time to wait for ring stability at startup. If the store-gateway ring keeps changing after this period of time, the store-gateway will start anyway.")

f.DurationVar(&cfg.FinalSleep, ringFlagsPrefix+"final-sleep", 0*time.Second, "The sleep seconds when store-gateway is shutting down.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this has default value 0s but other components have 30s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because SG we have replications, even we don't sleep and other pods still think this SG is good, it is okay. We will only get one error.

Copy link
Contributor

@yeya24 yeya24 Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot make such assumptions. Store gateway can still run with replication factor set to 1 if configured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, sure. Set the default to zero for SG, Ruler, and Alertmanager, so it is upto the user to make the right decision for the final delay.

If they don't have HA for ruler,alertmanager and decided to use DDB KV, then the delay need to be larger than DDB puller-sync-time.

If they have HA for ruler,alertmanger, then this finalSleep is not needed. They can afford one error.

Copy link
Contributor

@yeya24 yeya24 Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document this properly if we add this flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should I put the explanation. Kind of long. Tried to add some comments on the flag description. But don't think that is the good place to put long explanation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recommendation
"The sleep seconds wait before pod shutdown allowing a delay to propagate ring changes."

@wenxu1024 we need to add this flag in the changelog. There you can explain better. Maybe that is what @yeya24 was referring to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -79,6 +81,7 @@ func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
cfg.KVStore.RegisterFlagsWithPrefix(rfprefix, "alertmanagers/", f)
f.DurationVar(&cfg.HeartbeatPeriod, rfprefix+"heartbeat-period", 15*time.Second, "Period at which to heartbeat to the ring. 0 = disabled.")
f.DurationVar(&cfg.HeartbeatTimeout, rfprefix+"heartbeat-timeout", time.Minute, "The heartbeat timeout after which alertmanagers are considered unhealthy within the ring. 0 = never (timeout disabled).")
f.DurationVar(&cfg.FinalSleep, rfprefix+"final-sleep", 30*time.Second, "The sleep seconds when alertmanager is shutting down.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this flag useful mainly for migration between Memberlist to DDB? And not used in other cases?
If it is only useful for a specific scenario, I would prefer to set the default value to 0s, which doesn't change existing behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for memberlist this effect is small, because the information can reach other nodes within 2 secs. But for DDB, it takes 30s, we need to sleep to wait other pods know this pod is bad.
@danielblando what do you think? A default zero?

Copy link
Contributor

@yeya24 yeya24 Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have other ring implementations, too. Not only DDB and memberlist.
The value should take all as considerations.
To not break existing cases, we should do 0s and documenting what's the recommended value to set for each ring store

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before. Just set default to 0. So default behavior is same as before.

Signed-off-by: Wen Xu <wenxuamz@amazon.com>
Signed-off-by: Wen Xu <wenxuamz@amazon.com>
…tore

Signed-off-by: Wen Xu <wenxuamz@amazon.com>
Signed-off-by: Wen Xu <wenxuamz@amazon.com>
@wenxu1024 wenxu1024 force-pushed the add-delay-in-basic-life-cycler branch from a0d6547 to 0045d7a Compare August 29, 2023 17:46
Signed-off-by: Wen Xu <wenxuamz@amazon.com>
@wenxu1024 wenxu1024 force-pushed the add-delay-in-basic-life-cycler branch from 0045d7a to 94bcd70 Compare August 29, 2023 17:47
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge on green. Thanks @wenxu1024

@yeya24 yeya24 merged commit 7a76e51 into cortexproject:master Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants