Description
What would you like to be added (User Story)?
As operator I'd like to be able to reflect workload cluster's Node status on relevant Machine resources without any remediation.
Detailed Description
We're looking for a way to obtain more control over the MD's (and KCP in the future) rolling update process. First of all, we want to reflect custom conditions, that NPD set, on the Machine's Ready condition in order to pause rolling update until all checks are succeeded. It'll be possible out of the box by using MachineHealthCheck after v1beta2 is released, but MHC is tightly coupled with the remediation feature that we don't want to use.
In this case we want to reflect custom health checks from workload cluster on it's lifecycle management, controlled by CAPI.
Since the resource is named MachineHealthCheck, not MachineSelfHealing, I suppose it'd be ok to plug out remediation when we want to. It could be implemented just like one optional boolean field remediationDisabled
which will be backward compatible.
Anything else you would like to add?
Considered alternative solutions:
- Use
cluster.x-k8s.io/skip-remediation
annotation on Machine resources: it's implicit and error prone since the annotation can be easily forgotten when creating a new MD; - Use
maxUnhealthy: 0
: deprecated and has no alternatives for now (Deprecate MachineHealthCheck MaxUnhealthy and UnhealthyRange #10722) - Implement custom no-op remediation template: looks more like workaround than a solution
- Implement custom controller that will be doing the same things like MHC, but without remediation: but why not to just patch existing feature instead of duplication?
Label(s) to be applied
/kind feature
/area machinehealthcheck
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.