Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPEDGE-1191: feat: initial arbiter cluster enhancement #1674

Merged

Conversation

eggfoobar
Copy link
Contributor

@eggfoobar eggfoobar commented Sep 4, 2024

This enhancement describes adding a new control plane deployment type for OpenShift that would deploy OpenShift with 2 control plane nodes and 1 arbiter node that runs only critical components needed for HA.


For the component team leads, please check the box if you lgtm the changes in this enhancement.

Signed-off-by: ehila <ehila@redhat.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 4, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 4, 2024

@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.18.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2024
Copy link
Contributor

openshift-ci bot commented Sep 4, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

@brandisher brandisher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions from my side:

  1. What is the concrete list of control plane components that will run on the arbiter node?
  2. What non-control plane or control-plane supporting components need to exist on the arbiter node? (e.g. cluster observability)
  3. How do we let cluster admins (or other RH operators) opt-in to adding components to the arbiter? I'm thinking in cases where a customer has a monitoring workload that they need to run on the control plane, or we have operators with agents that must run on the control plane like ACM or GitOps.
  4. Are there any considerations needed for components that utilize the infrastructureTopology field to determine their replicas? (I believe this applies to observability components like prometheus et al)

Signed-off-by: ehila <ehila@redhat.com>
@eggfoobar eggfoobar force-pushed the arbiter-cluster-enhancement branch from bc94bef to 368a00d Compare October 1, 2024 06:29
Signed-off-by: ehila <ehila@redhat.com>
updated CEO changes to include idea for wrapping informer and lister to minimize CEO code changes
updated MCO changes to include bootstrap files for arbiter node
added installer changes needed to include new machine types and install config changes

Signed-off-by: ehila <ehila@redhat.com>
@eggfoobar
Copy link
Contributor Author

This is in good enough shape to move out of draft, I think we captured most of the concerns here.

@eggfoobar eggfoobar marked this pull request as ready for review October 16, 2024 06:20
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2024
@openshift-ci openshift-ci bot requested review from jaypoulz and rphillips October 16, 2024 06:20
added appropriate reviewers for the component changes
added more detail around the specific component changes

Signed-off-by: ehila <ehila@redhat.com>
@eggfoobar eggfoobar force-pushed the arbiter-cluster-enhancement branch from 3e2bf61 to 643fe55 Compare October 16, 2024 06:27
Signed-off-by: ehila <ehila@redhat.com>
@rphillips
Copy link
Contributor

rphillips commented Oct 18, 2024

Was Chad's comment addressed? What is running on the arbiter node?

Something has to be compromised on the arbiter - cpu, disk, memory? What is compromised on the arbiter? Nodes have classically had performance and runtime issues with slow disks or not enough cpu or memory. Is there a definition of what specs the arbiter has?

F1 F2 A1 (full-node-1 full-node-2 arbiter-node-1) ...

Additionally, based on the Arbiter's specs, if F1 F2 and A1 are nodes, and F2 goes down for a reboot, then F1 will likely be taking the brunt of the API+DB load of the cluster. Good loadbalancers route traffic to fast responding API backends. Usually when a highly available node goes down, there is enough capacity of the other nodes to take that load. If one node is under specced (A1) and F1 has to take the load, then it stands to reason that the specs for F1 need to be higher.

I'm a bit confused on how we would create a supported cluster in this situation. etcd needs fast disks, so we can't compromise on disk speed on A1. If load shifts from F2 to F1, then F1 should have the capacity of F1 + F2.

Optimizing and trimming down OpenShift components might be doable. Instead of creating an arbiter, we still require 3 nodes but run less things?

@DanielFroehlich
Copy link

Right, the arbiter is like a 3 node, but only etcd components and a bit of other absolutely essential stuff (e.g. networking components) run on the arbiter. No components like the API server, Console, Monitoring etc.. Exact details need to be defined tho.

F1 and F2 currently - per OCP docs, have a min sys req of 4C/16G. Thats just for the control plane. The requirements of the workload need to be added and are not considered here.
For A1, we currently discuss 2C/8G as min sys req, but that is to be decided and tested yet.

Etcd and control-plane can work with effective end-to-end latency (which includes disk) of 500ms using the slow profile. So no concerns here - see [here (https://docs.google.com/presentation/d/1eMz3ON5y1Rk2QI2zH-QUGh5oKoMyOHZrCSSZ8aHQQPQ/edit#slide=id.g26e42bb9faa_0_251). Yes, a slow arbiter node will slow down overall API performance.

During an outage of F2, not all load of F2 shifts to F1, actually next to none. Take the console as an example. It is replica 2 with a required podAntiAffinity on the hostname. Meaning: the pod running on F2 will not start on F1.
Scaled control plane components are scaled mostly for availability reasons, not scalability.

@eggfoobar and @jerpeter1 keep me honest here please.

@rphillips
Copy link
Contributor

rphillips commented Oct 18, 2024

@DanielFroehlich All Kube API load from F2 will transition to F1, since F1 will be the only machine with a kube-apiserver running.

How many workers will we support? Kubelet's and their API calls+watches are usually load balanced between 2-3 masters. This prevents thundering herd issues on the API servers. With only 1 API server, potentially, that will limit the number of workers a customer can add to the cluster.

| [Installer](#installer-changes) | Update install config API to Support arbiter machine types |
| [MCO](#mco-changes) | Update validation hook to support arbiter role and add bootstrapping configurations needed for arbiter machines |
| [Kubernetes](#kubernetes-change) | Update allowed `well_known_openshift_labels.go` to include new node role as an upstream carry |
| [ETCD Operator](#etcd-operator-change) | Update operator to deploy operands on both `master` and `arbiter` node roles |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the arbiter must not be the leader, then that's an additional feature that needs to be listed and designed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussion on what we want to happen here. We're okay with the Arbiter node etcd member becoming leader. We're under the impression that etcd will be as fast as it's slowest member, then this might be fine for now. I'll document this as an open question, and plan to revisit this as it's own potential EP, wdyt?

@tjungblu Any concerns with this assumption on my part?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no functionality that can prevent it becoming a leader. I'm mostly concerned about the performance impact reported here: etcd-io/etcd#14501

There are some upstream k8s PRs in the works that enable grpc "circuit breakers", so this will get resolved at some point in the future

@jerpeter1
Copy link
Member

@zaneb - do you have any more feedback, or is this lgtm?

Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
/cc @rwsu

@openshift-ci openshift-ci bot requested a review from rwsu December 6, 2024 04:33
added expectations for assisted installer and agent based installer
fix wording and spelling

Signed-off-by: ehila <ehila@redhat.com>
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 6, 2024

@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.19." or "openshift-4.19.", but it targets "openshift-4.18" instead.

In response to this:

This enhancement describes adding a new control plane deployment type for OpenShift that would deploy OpenShift with 2 control plane nodes and 1 arbiter node that runs only critical components needed for HA.


For the component team leads, please check the box if you lgtm the changes in this enhancement.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Comment on lines +440 to +448
### OLM Filter Addition

With this change it would be prudent to add the ability to allow operators to
specify which control plane topology they support. This gives us more guards
against installing layered components on unsupported or unconsidered topologies
like the Master+Arbiter in this enhancement.

This is a nice to have before we go GA and we should be looking into adding it
to OLM.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. The OLM team has some thoughts related to this along the lines of generic cluster-provided properties and layered-product provided constraints. This seems to be one use case for that feature. We'll have to think carefully about implementing this in a more general "properties vs. constraints" kind of design.

I don't know that I would make your GA contingent on this feature though since OCP already has multiple topology variants and OLM does not have this feature yet.

One thing we could consider in the meantime is a hint annotation in the CSV that the OCP console could be made aware of. OCP console already supports some of these annotations, and uses them to filter views presented to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I have it as a nice to have for GA, so we wont let be a critical part of our checklist. Thanks for your input Joe!

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 9, 2024

@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue.

In response to this:

This enhancement describes adding a new control plane deployment type for OpenShift that would deploy OpenShift with 2 control plane nodes and 1 arbiter node that runs only critical components needed for HA.


For the component team leads, please check the box if you lgtm the changes in this enhancement.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jerpeter1
Copy link
Member

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2024
Copy link
Contributor

openshift-ci bot commented Dec 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerpeter1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 9, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 9, 2024

@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue.

In response to this:

This enhancement describes adding a new control plane deployment type for OpenShift that would deploy OpenShift with 2 control plane nodes and 1 arbiter node that runs only critical components needed for HA.


For the component team leads, please check the box if you lgtm the changes in this enhancement.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Dec 9, 2024

@eggfoobar: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 3a53309 into openshift:master Dec 9, 2024
2 checks passed
validate some of the desires in this proposal. In it's current version this
proposal is not exhaustive but will be filled out as we implement these goals.

We currently expect this feature to mainly be used by `baremetal` installs, or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to prevent it from being used on other platforms? I'm concerned that this breaks many assumptions that CPMS makes about the control plane, and that it will be strictly incompatible. IIRC baremetal doesn't use CPMS at the moment, so perhaps that issue can be avoided for now

using cloud installs, since we can simply alter the manifests in the pipeline
with out needing to change the installer.

In the current POC, a cluster was created on AWS with 2 masters running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so what was CPMS doing in this cluster? I can't see any discussion of CPMS but we need to think about how we will handle this breaking its assumptions of having three control plane nodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can answer this, since I've tested this as part of 2NO. It spams the logs about the configuration being invalid, but doesn't take any action.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't happen to have anything recorded that captures those logs do you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I could probably get you something like that this afternoon :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you happen to be testing, and could grab the CPMS logs, I'd like to see exactly what it's complaining about to understand if it's in a safe broken state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, well that makes sense. You'd have to make sure the installer isn't trying to generate a CPMS with the incorrect number of nodes. Which for 2NO means no CPMS. For arbiter, not sure how that will work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Egli showed me how to disable CPMS in the manifests, which I think will cover our use case for 2NO.
@eggfoobar will probably have some thoughts on how to handle this properly for arbiter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPMS also has a generator, that will try to create a CPMS manifest should the cluster not already have one (which serves for upgrades). We probably want a check in there for the topology and have it disable when the topology is not one that it supports 🤔 How is 2no signalled in the ControlPlaneTopology?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to add an enum for DualReplica, following the pattern of SingleReplica.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no CPMSs for platform:baremetal afaik

to run or remove from this list. The list is small already so there might not be
a desire to pre-optimize just yet.

| Component | Component Outlook on Arbiter |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this list compare to a regular control plane machine? What's not running on the arbiter, that would be, were it a normal control plane machine?

@@ -137,6 +137,10 @@ topology migrations.

### Installer Changes

> Note: When we say `MachinePool` in the below statements, we are referring to
> the installer config struct used for generating resources and NOT the Machine
> Config Operator CR.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's the Machine API not the MCO that has MachinePool CRs.

using cloud installs, since we can simply alter the manifests in the pipeline
with out needing to change the installer.

In the current POC, a cluster was created on AWS with 2 masters running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no CPMSs for platform:baremetal afaik

service will contain different initial
[node-roles](https://github.com/openshift/machine-config-operator/blob/7c5ae75515fe373e3eecee711c18b07d01b5e759/templates/master/01-master-kubelet/_base/units/kubelet.service.yaml#L31)
and
[taints](https://github.com/openshift/machine-config-operator/blob/7c5ae75515fe373e3eecee711c18b07d01b5e759/templates/master/01-master-kubelet/_base/units/kubelet.service.yaml#L43).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eggfoobar did we consider reusing the existing control-plane role tag and only adding an extra taint (which only the etcd components would tolerate) via the MachineSet? It seems like that would vastly reduce the complexity we'd have to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zaneb, sorry for the late reply, didn’t see the notification for this until today.

That’s a great question, In short we hadn’t tried a different Taint but we did originally try master:NoSchedule, that worked for the most part and we did consider using a different Taint instead but a couple of design considerations pulled us away from it.

  • We needed a quasi control-plane node that supports HA but should be smaller and treated differently in terms of what runs on it and how we configure it. This node will definitely be smaller with different characteristics and in the future this might even be an ARM node or in a different location.
  • This has the high potential to be disruptive to components and the platform so we need to be explicit about the state of the cluster and allow component owners the signaling to decide how their component handles this deployment type.

Due to the uniqueness of the control plane, being a master node with a different taint poses different challenges. A big one is that things become implicit and a bit ambiguous. We currently have a few operators that tolerate all taints on control plane nodes, etcd being one of them. Updating those components would be easy enough so we can exclude them from the arbiter node but they would need to still check some authoritative flag that we are indeed in an arbiter deployment for replica modifications. In this case I think we continue to keep the ControlPlane API change. So you can probably get away with having taints be your enforcement for what runs on the arbiter, but with the API change we’re taking just a different road to the same path. The ambiguity comes from the customers side, we run the risk of components installed later running on the arbiter that shouldn't because they tolerate all taints on masters or not run at all. The cluster will says 3 control-plane nodes, so I can see where that might be confusing and the customer and they might just remove the taint from googling for solutions. The upside here is that we make fewer installer changes but we just gotta make sure the authoritative flag controlPlane API is set.

However, once we started to think about configuration aspect, having a machine that inherits from master is something we don’t support due to stability concerns, plus high probability that we would want to alter the same fields as master node configs. So the more we thought about it, the more we came back to the intent, we needed a quasi control plane node and the correct way to represent that seemed to be an explicit node role that was clear in its purpose.

So the gamble is we pay a little complexity upfront or we would pay it later when customers already had workloads running and tough decisions would mean tough conversations. If I recall correctly those were the main ideas running around my head a few months back when we were experimenting with different implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other observation I would make is that since I've been part of OpenShift, there has been an assumption that control-plane nodes are expected to be homogenous. We softly enforce this for cloud-based IPI installs by allowing a single instance type to be provisioned for all control-plane nodes.

I know we would support breaking this assumption for UPI clusters, and it might also be possible to have mismatched control plane nodes in a baremetal IPI install, but I'm not aware of any testing that verifies that mismatched control plane sizes behave as expected.

Keeping this as a separate node role helps us make this boundary more explicit. This node is expected to be different, and will thus also be tested as such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.