OCPEDGE-1191: feat: initial arbiter cluster enhancement #1674

eggfoobar · 2024-09-04T08:51:29Z

This enhancement describes adding a new control plane deployment type for OpenShift that would deploy OpenShift with 2 control plane nodes and 1 arbiter node that runs only critical components needed for HA.

For the component team leads, please check the box if you lgtm the changes in this enhancement.

Signed-off-by: ehila <ehila@redhat.com>

openshift-ci-robot · 2024-09-04T08:51:33Z

@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.18.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-09-04T08:51:42Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

enhancements/arbiter-clusters.md

brandisher

Questions from my side:

What is the concrete list of control plane components that will run on the arbiter node?
What non-control plane or control-plane supporting components need to exist on the arbiter node? (e.g. cluster observability)
How do we let cluster admins (or other RH operators) opt-in to adding components to the arbiter? I'm thinking in cases where a customer has a monitoring workload that they need to run on the control plane, or we have operators with agents that must run on the control plane like ACM or GitOps.
Are there any considerations needed for components that utilize the infrastructureTopology field to determine their replicas? (I believe this applies to observability components like prometheus et al)

enhancements/arbiter-clusters.md

Signed-off-by: ehila <ehila@redhat.com>

updated CEO changes to include idea for wrapping informer and lister to minimize CEO code changes updated MCO changes to include bootstrap files for arbiter node added installer changes needed to include new machine types and install config changes Signed-off-by: ehila <ehila@redhat.com>

enhancements/arbiter-clusters.md

eggfoobar · 2024-10-16T06:20:07Z

This is in good enough shape to move out of draft, I think we captured most of the concerns here.

added appropriate reviewers for the component changes added more detail around the specific component changes Signed-off-by: ehila <ehila@redhat.com>

Signed-off-by: ehila <ehila@redhat.com>

rphillips · 2024-10-18T03:05:58Z

Was Chad's comment addressed? What is running on the arbiter node?

Something has to be compromised on the arbiter - cpu, disk, memory? What is compromised on the arbiter? Nodes have classically had performance and runtime issues with slow disks or not enough cpu or memory. Is there a definition of what specs the arbiter has?

F1 F2 A1 (full-node-1 full-node-2 arbiter-node-1) ...

Additionally, based on the Arbiter's specs, if F1 F2 and A1 are nodes, and F2 goes down for a reboot, then F1 will likely be taking the brunt of the API+DB load of the cluster. Good loadbalancers route traffic to fast responding API backends. Usually when a highly available node goes down, there is enough capacity of the other nodes to take that load. If one node is under specced (A1) and F1 has to take the load, then it stands to reason that the specs for F1 need to be higher.

I'm a bit confused on how we would create a supported cluster in this situation. etcd needs fast disks, so we can't compromise on disk speed on A1. If load shifts from F2 to F1, then F1 should have the capacity of F1 + F2.

Optimizing and trimming down OpenShift components might be doable. Instead of creating an arbiter, we still require 3 nodes but run less things?

DanielFroehlich · 2024-10-18T07:39:42Z

Right, the arbiter is like a 3 node, but only etcd components and a bit of other absolutely essential stuff (e.g. networking components) run on the arbiter. No components like the API server, Console, Monitoring etc.. Exact details need to be defined tho.

F1 and F2 currently - per OCP docs, have a min sys req of 4C/16G. Thats just for the control plane. The requirements of the workload need to be added and are not considered here.
For A1, we currently discuss 2C/8G as min sys req, but that is to be decided and tested yet.

Etcd and control-plane can work with effective end-to-end latency (which includes disk) of 500ms using the slow profile. So no concerns here - see [here (https://docs.google.com/presentation/d/1eMz3ON5y1Rk2QI2zH-QUGh5oKoMyOHZrCSSZ8aHQQPQ/edit#slide=id.g26e42bb9faa_0_251). Yes, a slow arbiter node will slow down overall API performance.

During an outage of F2, not all load of F2 shifts to F1, actually next to none. Take the console as an example. It is replica 2 with a required podAntiAffinity on the hostname. Meaning: the pod running on F2 will not start on F1.
Scaled control plane components are scaled mostly for availability reasons, not scalability.

@eggfoobar and @jerpeter1 keep me honest here please.

rphillips · 2024-10-18T13:34:46Z

@DanielFroehlich All Kube API load from F2 will transition to F1, since F1 will be the only machine with a kube-apiserver running.

How many workers will we support? Kubelet's and their API calls+watches are usually load balanced between 2-3 masters. This prevents thundering herd issues on the API servers. With only 1 API server, potentially, that will limit the number of workers a customer can add to the cluster.

deads2k · 2024-10-21T15:13:03Z

enhancements/arbiter-clusters.md

+| [Installer](#installer-changes)                            | Update install config API to Support arbiter machine types                                                      |
+| [MCO](#mco-changes)                                        | Update validation hook to support arbiter role and add bootstrapping configurations needed for arbiter machines |
+| [Kubernetes](#kubernetes-change)                           | Update allowed `well_known_openshift_labels.go` to include new node role as an upstream carry                   |
+| [ETCD Operator](#etcd-operator-change)                     | Update operator to deploy operands on both `master` and `arbiter` node roles                                    |


If the arbiter must not be the leader, then that's an additional feature that needs to be listed and designed.

After some discussion on what we want to happen here. We're okay with the Arbiter node etcd member becoming leader. We're under the impression that etcd will be as fast as it's slowest member, then this might be fine for now. I'll document this as an open question, and plan to revisit this as it's own potential EP, wdyt?

@tjungblu Any concerns with this assumption on my part?

there is no functionality that can prevent it becoming a leader. I'm mostly concerned about the performance impact reported here: etcd-io/etcd#14501

There are some upstream k8s PRs in the works that enable grpc "circuit breakers", so this will get resolved at some point in the future

jerpeter1 · 2024-12-04T03:25:24Z

@zaneb - do you have any more feedback, or is this lgtm?

zaneb

lgtm
/cc @rwsu

enhancements/arbiter-clusters.md

added expectations for assisted installer and agent based installer fix wording and spelling Signed-off-by: ehila <ehila@redhat.com>

openshift-ci-robot · 2024-12-06T19:46:41Z

@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.19." or "openshift-4.19.", but it targets "openshift-4.18" instead.

In response to this:

This enhancement describes adding a new control plane deployment type for OpenShift that would deploy OpenShift with 2 control plane nodes and 1 arbiter node that runs only critical components needed for HA.

For the component team leads, please check the box if you lgtm the changes in this enhancement.

@tjungblu

@patrickdillon

@racedo

@jerpeter1

@sjenning

@yuqi-zhang

@zaneb

@rphillips

@joelanford

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

joelanford · 2024-12-09T14:10:56Z

enhancements/arbiter-clusters.md

+### OLM Filter Addition
+
+With this change it would be prudent to add the ability to allow operators to
+specify which control plane topology they support. This gives us more guards
+against installing layered components on unsupported or unconsidered topologies
+like the Master+Arbiter in this enhancement.
+
+This is a nice to have before we go GA and we should be looking into adding it
+to OLM.


+1. The OLM team has some thoughts related to this along the lines of generic cluster-provided properties and layered-product provided constraints. This seems to be one use case for that feature. We'll have to think carefully about implementing this in a more general "properties vs. constraints" kind of design.

I don't know that I would make your GA contingent on this feature though since OCP already has multiple topology variants and OLM does not have this feature yet.

One thing we could consider in the meantime is a hint annotation in the CSV that the OCP console could be made aware of. OCP console already supports some of these annotations, and uses them to filter views presented to users.

Awesome, I have it as a nice to have for GA, so we wont let be a critical part of our checklist. Thanks for your input Joe!

openshift-ci-robot · 2024-12-09T14:11:26Z

@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue.

In response to this:

This enhancement describes adding a new control plane deployment type for OpenShift that would deploy OpenShift with 2 control plane nodes and 1 arbiter node that runs only critical components needed for HA.

For the component team leads, please check the box if you lgtm the changes in this enhancement.

@tjungblu

@patrickdillon

@racedo

@jerpeter1

@sjenning

@yuqi-zhang

@zaneb

@rphillips

@joelanford

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jerpeter1 · 2024-12-09T14:59:51Z

/lgtm
/approve

openshift-ci · 2024-12-09T15:00:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerpeter1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jerpeter1]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-12-09T15:00:46Z

@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue.

In response to this:

This enhancement describes adding a new control plane deployment type for OpenShift that would deploy OpenShift with 2 control plane nodes and 1 arbiter node that runs only critical components needed for HA.

For the component team leads, please check the box if you lgtm the changes in this enhancement.

@tjungblu

@patrickdillon

@racedo

@jerpeter1

@sjenning

@yuqi-zhang

@zaneb

@rphillips

@joelanford

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-12-09T15:26:48Z

@eggfoobar: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

JoelSpeed · 2024-12-10T12:16:18Z

enhancements/arbiter-clusters.md

+validate some of the desires in this proposal. In it's current version this
+proposal is not exhaustive but will be filled out as we implement these goals.
+
+We currently expect this feature to mainly be used by `baremetal` installs, or


Do you plan to prevent it from being used on other platforms? I'm concerned that this breaks many assumptions that CPMS makes about the control plane, and that it will be strictly incompatible. IIRC baremetal doesn't use CPMS at the moment, so perhaps that issue can be avoided for now

JoelSpeed · 2024-12-10T12:17:37Z

enhancements/arbiter-clusters.md

+using cloud installs, since we can simply alter the manifests in the pipeline
+with out needing to change the installer.
+
+In the current POC, a cluster was created on AWS with 2 masters running


Ok, so what was CPMS doing in this cluster? I can't see any discussion of CPMS but we need to think about how we will handle this breaking its assumptions of having three control plane nodes

I can answer this, since I've tested this as part of 2NO. It spams the logs about the configuration being invalid, but doesn't take any action.

You didn't happen to have anything recorded that captures those logs do you?

No, but I could probably get you something like that this afternoon :)

If you happen to be testing, and could grab the CPMS logs, I'd like to see exactly what it's complaining about to understand if it's in a safe broken state

Ahh yes, well that makes sense. You'd have to make sure the installer isn't trying to generate a CPMS with the incorrect number of nodes. Which for 2NO means no CPMS. For arbiter, not sure how that will work

Egli showed me how to disable CPMS in the manifests, which I think will cover our use case for 2NO.
@eggfoobar will probably have some thoughts on how to handle this properly for arbiter.

CPMS also has a generator, that will try to create a CPMS manifest should the cluster not already have one (which serves for upgrades). We probably want a check in there for the topology and have it disable when the topology is not one that it supports 🤔 How is 2no signalled in the ControlPlaneTopology?

The plan is to add an enum for DualReplica, following the pattern of SingleReplica.

There are no CPMSs for platform:baremetal afaik

JoelSpeed · 2024-12-10T12:19:45Z

enhancements/arbiter-clusters.md

+to run or remove from this list. The list is small already so there might not be
+a desire to pre-optimize just yet.
+
+| Component                                                                                          | Component Outlook on Arbiter                                      |


How does this list compare to a regular control plane machine? What's not running on the arbiter, that would be, were it a normal control plane machine?

zaneb · 2024-12-09T04:45:40Z

enhancements/arbiter-clusters.md

@@ -137,6 +137,10 @@ topology migrations.

 ### Installer Changes

+> Note: When we say `MachinePool` in the below statements, we are referring to
+> the installer config struct used for generating resources and NOT the Machine
+> Config Operator CR.


nit: it's the Machine API not the MCO that has MachinePool CRs.

zaneb · 2024-12-12T02:49:22Z

enhancements/arbiter-clusters.md

+using cloud installs, since we can simply alter the manifests in the pipeline
+with out needing to change the installer.
+
+In the current POC, a cluster was created on AWS with 2 masters running


There are no CPMSs for platform:baremetal afaik

zaneb · 2025-02-05T10:06:59Z

enhancements/arbiter-clusters.md

+service will contain different initial
+[node-roles](https://github.com/openshift/machine-config-operator/blob/7c5ae75515fe373e3eecee711c18b07d01b5e759/templates/master/01-master-kubelet/_base/units/kubelet.service.yaml#L31)
+and
+[taints](https://github.com/openshift/machine-config-operator/blob/7c5ae75515fe373e3eecee711c18b07d01b5e759/templates/master/01-master-kubelet/_base/units/kubelet.service.yaml#L43).


@eggfoobar did we consider reusing the existing control-plane role tag and only adding an extra taint (which only the etcd components would tolerate) via the MachineSet? It seems like that would vastly reduce the complexity we'd have to maintain.

Hey @zaneb, sorry for the late reply, didn’t see the notification for this until today.

That’s a great question, In short we hadn’t tried a different Taint but we did originally try master:NoSchedule, that worked for the most part and we did consider using a different Taint instead but a couple of design considerations pulled us away from it.

We needed a quasi control-plane node that supports HA but should be smaller and treated differently in terms of what runs on it and how we configure it. This node will definitely be smaller with different characteristics and in the future this might even be an ARM node or in a different location.

This has the high potential to be disruptive to components and the platform so we need to be explicit about the state of the cluster and allow component owners the signaling to decide how their component handles this deployment type.

Due to the uniqueness of the control plane, being a master node with a different taint poses different challenges. A big one is that things become implicit and a bit ambiguous. We currently have a few operators that tolerate all taints on control plane nodes, etcd being one of them. Updating those components would be easy enough so we can exclude them from the arbiter node but they would need to still check some authoritative flag that we are indeed in an arbiter deployment for replica modifications. In this case I think we continue to keep the ControlPlane API change. So you can probably get away with having taints be your enforcement for what runs on the arbiter, but with the API change we’re taking just a different road to the same path. The ambiguity comes from the customers side, we run the risk of components installed later running on the arbiter that shouldn't because they tolerate all taints on masters or not run at all. The cluster will says 3 control-plane nodes, so I can see where that might be confusing and the customer and they might just remove the taint from googling for solutions. The upside here is that we make fewer installer changes but we just gotta make sure the authoritative flag controlPlane API is set.

However, once we started to think about configuration aspect, having a machine that inherits from master is something we don’t support due to stability concerns, plus high probability that we would want to alter the same fields as master node configs. So the more we thought about it, the more we came back to the intent, we needed a quasi control plane node and the correct way to represent that seemed to be an explicit node role that was clear in its purpose.

So the gamble is we pay a little complexity upfront or we would pay it later when customers already had workloads running and tough decisions would mean tough conversations. If I recall correctly those were the main ideas running around my head a few months back when we were experimenting with different implementations.

The other observation I would make is that since I've been part of OpenShift, there has been an assumption that control-plane nodes are expected to be homogenous. We softly enforce this for cloud-based IPI installs by allowing a single instance type to be provisioned for all control-plane nodes.

I know we would support breaking this assumption for UPI clusters, and it might also be possible to have mismatched control plane nodes in a baremetal IPI install, but I'm not aware of any testing that verifies that mismatched control plane sizes behave as expected.

Keeping this as a separate node role helps us make this boundary more explicit. This node is expected to be different, and will thus also be tested as such.

feat: initial arbiter cluster enhancement

ba86653

Signed-off-by: ehila <ehila@redhat.com>

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 4, 2024

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2024

jeff-roche reviewed Sep 4, 2024

View reviewed changes

qJkee reviewed Sep 5, 2024

View reviewed changes

enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved

qJkee reviewed Sep 5, 2024

View reviewed changes

enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved

qJkee reviewed Sep 5, 2024

View reviewed changes

enhancements/arbiter-clusters.md Show resolved Hide resolved

brandisher reviewed Sep 5, 2024

View reviewed changes

DanielFroehlich reviewed Sep 5, 2024

View reviewed changes

upkeep: updated with suggestions

368a00d

Signed-off-by: ehila <ehila@redhat.com>

eggfoobar force-pushed the arbiter-cluster-enhancement branch from bc94bef to 368a00d Compare October 1, 2024 06:29

eggfoobar added 2 commits October 1, 2024 09:37

upkeep: add clear HA goal

f2e9b91

Signed-off-by: ehila <ehila@redhat.com>

tjungblu reviewed Oct 8, 2024

View reviewed changes

enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved

tjungblu reviewed Oct 8, 2024

View reviewed changes

enhancements/arbiter-clusters.md Show resolved Hide resolved

tjungblu reviewed Oct 8, 2024

View reviewed changes

enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved

eggfoobar marked this pull request as ready for review October 16, 2024 06:20

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2024

openshift-ci bot requested review from jaypoulz and rphillips October 16, 2024 06:20

update: reviewers and component changes

643fe55

added appropriate reviewers for the component changes added more detail around the specific component changes Signed-off-by: ehila <ehila@redhat.com>

eggfoobar force-pushed the arbiter-cluster-enhancement branch from 3e2bf61 to 643fe55 Compare October 16, 2024 06:27

upkeep: add more details around test plan

c33d507

Signed-off-by: ehila <ehila@redhat.com>

deads2k reviewed Oct 21, 2024

View reviewed changes

zaneb reviewed Dec 6, 2024

View reviewed changes

openshift-ci bot requested a review from rwsu December 6, 2024 04:33

upkeep: added suggestions for baremetal changes

ea74074

added expectations for assisted installer and agent based installer fix wording and spelling Signed-off-by: ehila <ehila@redhat.com>

joelanford reviewed Dec 9, 2024

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 9, 2024

openshift-merge-bot bot merged commit 3a53309 into openshift:master Dec 9, 2024
2 checks passed

eggfoobar mentioned this pull request Dec 9, 2024

OCPEDGE-1307: feat: add HighlyAvailableArbiter control plane topology as feature for techpreview openshift/api#2082

Merged

JoelSpeed reviewed Dec 10, 2024

View reviewed changes

zaneb reviewed Dec 12, 2024

View reviewed changes

eggfoobar mentioned this pull request Dec 17, 2024

OCPEDGE-1195: Arbiter node support openshift/cluster-etcd-operator#1366

Merged

eggfoobar mentioned this pull request Jan 7, 2025

OCPEDGE-1313: feat: added assets for arbiter node role openshift/machine-config-operator#4675

Merged

eggfoobar mentioned this pull request Jan 17, 2025

OCPEDGE-1310: feat: add arbiter node support to installer openshift/installer#9159

Merged

eggfoobar mentioned this pull request Jan 27, 2025

OCPEDGE-1192: feat: initial arbiter node addition openshift-metal3/dev-scripts#1731

Merged

zaneb reviewed Feb 5, 2025

View reviewed changes

qJkee mentioned this pull request Feb 17, 2025

OCPEDGE-1311: add two node openshift deployment openshift/assisted-service#7321

Closed

20 tasks

eggfoobar mentioned this pull request Feb 18, 2025

OCPEDGE-1607: feat: add arbiter node selector openshift/cluster-baremetal-operator#460

Merged

OCPEDGE-1191: feat: initial arbiter cluster enhancement #1674

OCPEDGE-1191: feat: initial arbiter cluster enhancement #1674

Uh oh!

Conversation

eggfoobar commented Sep 4, 2024 • edited by jerpeter1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 4, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Sep 4, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brandisher left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eggfoobar commented Oct 16, 2024

Uh oh!

rphillips commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DanielFroehlich commented Oct 18, 2024

Uh oh!

rphillips commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerpeter1 commented Dec 4, 2024

Uh oh!

zaneb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci-robot commented Dec 6, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Dec 9, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

eggfoobar commented Sep 4, 2024 •

edited by jerpeter1

Loading

openshift-ci-robot commented Sep 4, 2024 •

edited by openshift-ci bot

Loading

brandisher left a comment •

edited

Loading

rphillips commented Oct 18, 2024 •

edited

Loading

rphillips commented Oct 18, 2024 •

edited

Loading

openshift-ci-robot commented Dec 6, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Dec 9, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Dec 9, 2024 •

edited by openshift-ci bot

Loading