-
Notifications
You must be signed in to change notification settings - Fork 1.1k
choose switchover candidate based on lag and role #1700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Sounds good, lets make it the one with least lag, not with 0 lag. |
| return spec.NamespacedName{Namespace: master.Namespace, Name: candidates[0].Name}, nil | ||
| } | ||
|
|
||
| return spec.NamespacedName{}, fmt.Errorf("no switchover candidate found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does it return the empty spec.NamespacedName{} and not nil in case of error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go says: "cannot use nil (untyped nil value) as spec.NamespacedName value in return statement" 😃
| Replica PostgresRole = "replica" | ||
| Leader PostgresRole = "leader" | ||
| StandbyLeader PostgresRole = "standby_leader" | ||
| SyncStandby PostgresRole = "sync_standby" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel there is a comment missing here:
- how does operator treat differently
Master,StandbyLeaderon the one hand andLeaderon the other ? BothMasterandStandbyLeaderare leaders. SyncStanbdyis still aReplica
While it is still possible to figure it out from the code, a short comment about why these are not completely independent states is worthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master and replica are Spilo roles. The other ones role names returned by Patroni's cluster endpoint. Added comments
pkg/util/patroni/patroni.go
Outdated
| Role string `json:"role"` | ||
| State string `json:"state"` | ||
| Timeline int `json:"timeline"` | ||
| LagInMB int `json:"lag"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Patroni always return lag in MB and not simply in bytes ?
for example, the patronictl converts bytes to MB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to just Lag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm maybe lag is reserved for something else:
cannot unmarshal string into Go struct field ClusterMember.members.lag of type int
renamed it back, since I think it's always in MB.
|
👍 |
1 similar comment
|
👍 |
Change 1: When attempting a switchover, try it only on a replica that has no replication lag. In case synchronous replication is enabled there will be a replica with
sync_standbyrole which should be chosen instead for switchover.Change 2: In case the switchover is not executed or failing, return an error instead of emitting a
Warning. Before we were always continuing by replacing the master pod anyway - producing downtime. Now, the master keeps running on an outdated pod. But, the Pod will have the rolling-update flag and the operator will notice it on the next sync and try another failover. It can be, that the failover just took longer, e.g. because taking a CHECKPOINT took longer than Patroni API waits to respond to the user. In this ideal case only a replica would be recreated. In worst case, a switchover might be retried on each sync. But this should be fixed by the admin anyway.fixes #1686
fixes #109
fixes #600