ETCD-328: automatic replacement of an unhealthy member machine#947
Conversation
|
/hold |
ac83f08 to
e3bc749
Compare
|
Tested on a 4.12 cluster && this PR atop .. Scenario :
Logs from etcd operator |
| } | ||
| if currentVotingMemberIPListSet.Len() <= desiredControlPlaneReplicasCount { | ||
| klog.V(2).Infof("@Mustafa: currentVotingMemberIPListSet.Len() is %v", currentVotingMemberIPListSet.Len()) | ||
| if currentVotingMemberIPListSet.Len() < desiredControlPlaneReplicasCount { |
There was a problem hiding this comment.
| if currentVotingMemberIPListSet.Len() < desiredControlPlaneReplicasCount { | |
| if liveVotingMembers, err := c.getAllVotingMembers(ctx); err == nil && len(liveVotingMembers) < desiredControlPlaneReplicasCount { |
@hasbro17 I propose this change, since the etcd-config map takes some time to remove member's IP after being removed from the membership
@tjungblu wdyt ?
cc @JoelSpeed
There was a problem hiding this comment.
Just so I understand the change, is the change to make sure we are getting the current voting members from the etcd cluster itself rather than from the configmap? If that's the case that seems like a sensible move to make sure we are not going to break quorum at all, so +1 from me
There was a problem hiding this comment.
@JoelSpeed yes exactly, this is the reason .. also all test passed :D 🎉
i am testing now and will update with the result, before cleaning the logs and adding tests
|
/retest-required The test cases passes locally !! :/ |
|
/retest-required |
|
|
||
| liveVotingMembers, err := c.getAllVotingMembers(ctx) | ||
| if err != nil { | ||
| return err |
Scenario
|
Scenario
|
Scenario
|
e7dcf51 to
5d64b1c
Compare
|
/hold cancel |
|
/label tide/merge-method-squash |
Scenario
|
Scenario
|
|
/assign @hasbro17 |
|
/retest-required |
1 similar comment
|
/retest-required |
|
I've left some comments on the testing docs but I think there's some assumptions going into these test cases about how the CPMS works that aren't true. CPMS works based on a concept of indexes, each index represents one desired control plane instance, and these indexes are typically 0, 1 and 2 in a cluster that you've just created. So for example, the installer creates a machine in index 0 called master-0. When the CPMS sees an additional machine in the cluster, say, called master-3, it thinks this is a 4th index. Because the CPMS by default is set to If the CPMS were to see a machine call master-foo-0, then, because this has a 0 at the end, it would assume this is part of index 0. This is how we track replacing machines btw. So it would say, ok, I now have 2 machines in index 0, one of them must be replacing the other. If you delete one of these, eventually the cluster should settle again and CPMS will be happy. We actually intend to delete those extra machines in the index, but our bug is currently blocked because the etcd e2e can't handle that behaviour yet. I think in the testing here you've created extra indexes, then when delete the machines the CPMS sees 3 machines in 3 indexes and goes back to an available state, which is why no more machines are created, it's happy once they've been deleted Note horizontal scaling is not a feature of CPMS yet, we need to make sure etcd can horizontally scale before we were to introduce that, so you are stuck with 3 replicas (you can force it to 5 if you want but this isn't technically supported, no customer is doing this) |
@JoelSpeed there's a customer that runs this, which is why we have the five control plane test in our PRs, so it's certainly supported and used. |
tjungblu
left a comment
There was a problem hiding this comment.
left couple comments, lgtm otherwise. Great tests and love the logs.
| liveVotingMembers, err := c.getAllVotingMembers(ctx) | ||
| if err != nil { | ||
| // TODO : update status condition | ||
| // TODO : should go degraded ? |
There was a problem hiding this comment.
nope, that's a transient issue and will be retried next recon cycle
There was a problem hiding this comment.
hmmm, good point. So out of curiosity, when should we use Status.conditions ?
I learnt about them recently in k8s operators tutorials :D
There was a problem hiding this comment.
when you want to tell the admin to fix something specifically. I'd personally still prefer logs as they are a bit easier to understand over time.
Going Degraded only if we should block the upgrades, but my philosophy is also different than what's used across the codebase.
There was a problem hiding this comment.
thanks a lot, now i got it :)
| return fmt.Errorf("could not find master machines with deletion hook: %w", err) | ||
| } | ||
|
|
||
| var votingMembersMachines []*machinev1beta1.Machine |
There was a problem hiding this comment.
just to make the below logic consistent, should the currentVotingMemberIPListSet sourced from liveVotingMembers too? Otherwise you have one set generated by configmap, the other from the etcd memberlist.
There was a problem hiding this comment.
I actually wanted to do this, but did not want to change the logic completely.
I had to use liveVotingMembers for deciding whether attempt scaling down, or skip. During testing, I noticed that the configmap takes time to reflect the cluster membership, and the CEO keep spinning waiting for a new CM to be created.
+1 to replace currentVotingMemberIPListSet with liveVotingMembers, wdyt @hasbro17 ?
There was a problem hiding this comment.
I had the opposite thought actually. For the default case when we're checking if scale-down is allowed i.e votingMembers >= desiredControlPlaneReplicasCount we don't want to make live calls to the etcd cluster.
The reason why we use currentVotingMemberIPListSet (sourced from the etcd-endpoints configmap) is so we're not inducing unnecessary load in the default case by using a cache instead of live calls to cluster.
During testing, I noticed that the configmap takes time to reflect the cluster membership, and the CEO keep spinning waiting for a new CM to be created.
Not sure what CM here is (assuming machine) but other than the delay @Elbehery Is there a reason why you can't use the currentVotingMemberIPListSet in L128-135 above? How much is the delay in practice? I would expect the etcd-endpoints controller to update within the minute once there is an actual membership change.
There was a problem hiding this comment.
Thanks @hasbro17 for your review
votingMembers >= desiredControlPlaneReplicasCount we don't want to make live calls to the etcd cluster.
the etcd-endpoint CM contains all the voting member ( i.e. non-learner ) whether healthy or unhealthy. In #937 (comment) we agreed to constrain to the scaling-down invariant around healthyVotingMembers iiuc
we're not inducing unnecessary load in the default case by using a cache instead of live calls to cluster.
I think listing the cluster members is not a load, moreover the recent changes to etcdcli package reuses the connections :)
I would expect the etcd-endpoints controller to update within the minute once there is an actual membership change.
Yes this is correct, it takes around a minutes to update the CM .. I can change it back, but may I ask listing the healthy members directly from the etcd cluster is not correct ?
There was a problem hiding this comment.
The reason why we use currentVotingMemberIPListSet (sourced from the etcd-endpoints configmap) is so we're not inducing unnecessary load in the default case by using a cache instead of live calls to cluster.
That makes sense, I think we make lots of MemberList calls across the operator already (health probe, etc). The connection to etcd is already open as Mustafa said, so getting the current list of members is not an expensive call.
I'm all up for consolidating this listing into a single controller that queries this in the background every 5-10s and all other controllers just feed from that.
How much is the delay in practice?
I don't have a direct answer for that, but let's go with the case I've added to the quorum guard here:
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go#L145
So when you're down to two nodes, we'll never update the CM until you add a new node. I realize now that this is stupid, but we have no other way to selectively tell the static pod controller not to rollout a revision based on that.
In that worst case, it takes how long it takes the cluster admin to add a new node.
There was a problem hiding this comment.
I'm all up for consolidating this listing into a single controller that queries this in the background every 5-10s and all other controllers just feed from that.
+1
we'll never update the CM until you add a new node.
In this case I think listing the members from etcd client makes more sense, or ?
There was a problem hiding this comment.
Yeah I forgot that we'll have client connections open from the client cache so I guess having this controller make periodic live list calls might not add a significant load. So I guess we can keep the live call here.
So when you're down to two nodes, we'll never update the CM until you add a new node
Yeah that's not ideal especially since the apiserver is using the endpoints configmap and will end up talking to a non-existent endpoint if we have 2 members and the CM is outdated with 3. But it's a necessary trade-off that we've made given that we don't want quorum loss mid rollout.
Plus with CPMS once we should have a new machine who's member would soon be added and promoted right after the unhealthy one is removed so this gap shouldn't be too large.
Let's keep the live list for now, and leave off the consolidation of list calls to a better internal cache than the etcd-endpoints CM as a follow up.
There was a problem hiding this comment.
if we agree on sticking with the member listing, then I would definitely like to see currentVotingMemberIPListSet to come from the member listing. Otherwise this is going to be wildly inconsistent.
| return false | ||
| } | ||
|
|
||
| func minimumTolerableQuorum(desiredControlPlaneReplicasCount int) int { |
There was a problem hiding this comment.
we have something similar already in:
cluster-etcd-operator/pkg/etcdcli/health.go
Lines 241 to 268 in 2ec6e4d
Can we reuse that instead?
There was a problem hiding this comment.
Yes I tried, but the conditions used here are too strict and this was the problem of scaling down
I use totalMembers < desiredControlPlaneCount as a first check to decide if we can attempt to scale down, or skip.
if len(liveVotingMembers) < desiredControlPlaneReplicasCount {
klog.V(2).Infof("Ignoring scale-down since the number of etcd members (%d) < desired number of control-plane replicas (%d) ", len(liveVotingMembers), desiredControlPlaneReplicasCount)
return nil
}
I use len(healthyLiveVotingMembers) < minimumTolerableQuorum(desiredControlPlaneReplicasCount) to decide if we have a chance to scale down without losing Quorum
The conditions used in IsQuorumFaultTolerant() is too strict for scaling down
There was a problem hiding this comment.
that's fair, but can we consolidate the formula for computing this into a shared func instead? so not everybody has to re-invent this math and test it again, with what parameters you feed the formula is up to the caller.
There was a problem hiding this comment.
I am fine with that, but both methods have different args 🤔
There was a problem hiding this comment.
@Elbehery You don't have to reuse the IsQuorumFaultTolerant function as it is or change its args but you can consolidate the quorum computation to a new function and reuse that in both IsQuorumFaultTolerant and in minimumTolerableQuorum so we're not maintaining the same thing in different packages.
a71acc6 to
b2c4a83
Compare
|
/retest-required |
| return fmt.Errorf("could not find master machines with deletion hook: %w", err) | ||
| } | ||
|
|
||
| var votingMembersMachines []*machinev1beta1.Machine |
There was a problem hiding this comment.
if we agree on sticking with the member listing, then I would definitely like to see currentVotingMemberIPListSet to come from the member listing. Otherwise this is going to be wildly inconsistent.
pkg/etcdcli/health.go
Outdated
| } | ||
| } | ||
|
|
||
| func MinimumTolerableQuorum(members int) int { |
There was a problem hiding this comment.
can you please add a unit test for that?
pkg/etcdcli/health_test.go
Outdated
| t.Run(scenario.name, func(t *testing.T) { | ||
| actual, err := MinimumTolerableQuorum(scenario.input) | ||
|
|
||
| if scenario.expErr != nil { |
There was a problem hiding this comment.
you can simplify all your if conditions below with a simple:
require.Equal(t, scenario.expErr, err)
require.Equal(t, scenario.exp, actual)
There was a problem hiding this comment.
thanks I never used this :D
bdd1393 to
73258ae
Compare
| func VotingMemberIPListSet(configMapLister corev1listers.ConfigMapNamespaceLister) (sets.String, error) { | ||
| etcdEndpointsConfigMap, err := configMapLister.Get("etcd-endpoints") | ||
| func VotingMemberIPListSet(ctx context.Context, cli etcdcli.EtcdClient) (sets.String, error) { | ||
| members, err := cli.VotingMemberList(ctx) |
There was a problem hiding this comment.
just replicate what the member controller does in:
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go#L132-L136
pkg/operator/ceohelpers/common.go
Outdated
| for _, votingMemberIP := range etcdEndpointsConfigMap.Data { | ||
| currentVotingMemberIPListSet.Insert(votingMemberIP) | ||
| for _, member := range members { | ||
| currentVotingMemberIPListSet.Insert(member.PeerURLs[0]) |
There was a problem hiding this comment.
@tjungblu this is the member's ip, right ?
73258ae to
24bbc13
Compare
24bbc13 to
fc6a345
Compare
|
/retest-required |
2 similar comments
|
/retest-required |
|
/retest-required |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Elbehery, hasbro17, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
@Elbehery: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@Elbehery: cannot checkout DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-4.12 |
|
@Elbehery: new pull request created: #960 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR attempts to allow replacing an unhealthy etcd member from the cluster, with a replacement as learner. It relaxes the scaling down invariant to the minimum tolerable threshold in order to maintain Quorum.