-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPEDGE-1191: feat: initial arbiter cluster enhancement #1674
OCPEDGE-1191: feat: initial arbiter cluster enhancement #1674
Conversation
Signed-off-by: ehila <ehila@redhat.com>
@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.18.0" version, but no target version was set. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions from my side:
- What is the concrete list of control plane components that will run on the arbiter node?
- What non-control plane or control-plane supporting components need to exist on the arbiter node? (e.g. cluster observability)
- How do we let cluster admins (or other RH operators) opt-in to adding components to the arbiter? I'm thinking in cases where a customer has a monitoring workload that they need to run on the control plane, or we have operators with agents that must run on the control plane like ACM or GitOps.
- Are there any considerations needed for components that utilize the
infrastructureTopology
field to determine their replicas? (I believe this applies to observability components like prometheus et al)
Signed-off-by: ehila <ehila@redhat.com>
bc94bef
to
368a00d
Compare
Signed-off-by: ehila <ehila@redhat.com>
updated CEO changes to include idea for wrapping informer and lister to minimize CEO code changes updated MCO changes to include bootstrap files for arbiter node added installer changes needed to include new machine types and install config changes Signed-off-by: ehila <ehila@redhat.com>
This is in good enough shape to move out of draft, I think we captured most of the concerns here. |
added appropriate reviewers for the component changes added more detail around the specific component changes Signed-off-by: ehila <ehila@redhat.com>
3e2bf61
to
643fe55
Compare
Signed-off-by: ehila <ehila@redhat.com>
Was Chad's comment addressed? What is running on the arbiter node? Something has to be compromised on the arbiter - cpu, disk, memory? What is compromised on the arbiter? Nodes have classically had performance and runtime issues with slow disks or not enough cpu or memory. Is there a definition of what specs the arbiter has? F1 F2 A1 (full-node-1 full-node-2 arbiter-node-1) ... Additionally, based on the Arbiter's specs, if F1 F2 and A1 are nodes, and F2 goes down for a reboot, then F1 will likely be taking the brunt of the API+DB load of the cluster. Good loadbalancers route traffic to fast responding API backends. Usually when a highly available node goes down, there is enough capacity of the other nodes to take that load. If one node is under specced (A1) and F1 has to take the load, then it stands to reason that the specs for F1 need to be higher. I'm a bit confused on how we would create a supported cluster in this situation. etcd needs fast disks, so we can't compromise on disk speed on A1. If load shifts from F2 to F1, then F1 should have the capacity of F1 + F2. Optimizing and trimming down OpenShift components might be doable. Instead of creating an arbiter, we still require 3 nodes but run less things? |
Right, the arbiter is like a 3 node, but only etcd components and a bit of other absolutely essential stuff (e.g. networking components) run on the arbiter. No components like the API server, Console, Monitoring etc.. Exact details need to be defined tho. F1 and F2 currently - per OCP docs, have a min sys req of 4C/16G. Thats just for the control plane. The requirements of the workload need to be added and are not considered here. Etcd and control-plane can work with effective end-to-end latency (which includes disk) of 500ms using the slow profile. So no concerns here - see [here (https://docs.google.com/presentation/d/1eMz3ON5y1Rk2QI2zH-QUGh5oKoMyOHZrCSSZ8aHQQPQ/edit#slide=id.g26e42bb9faa_0_251). Yes, a slow arbiter node will slow down overall API performance. During an outage of F2, not all load of F2 shifts to F1, actually next to none. Take the console as an example. It is replica 2 with a required podAntiAffinity on the hostname. Meaning: the pod running on F2 will not start on F1. @eggfoobar and @jerpeter1 keep me honest here please. |
@DanielFroehlich All Kube API load from F2 will transition to F1, since F1 will be the only machine with a kube-apiserver running. How many workers will we support? Kubelet's and their API calls+watches are usually load balanced between 2-3 masters. This prevents thundering herd issues on the API servers. With only 1 API server, potentially, that will limit the number of workers a customer can add to the cluster. |
enhancements/arbiter-clusters.md
Outdated
| [Installer](#installer-changes) | Update install config API to Support arbiter machine types | | ||
| [MCO](#mco-changes) | Update validation hook to support arbiter role and add bootstrapping configurations needed for arbiter machines | | ||
| [Kubernetes](#kubernetes-change) | Update allowed `well_known_openshift_labels.go` to include new node role as an upstream carry | | ||
| [ETCD Operator](#etcd-operator-change) | Update operator to deploy operands on both `master` and `arbiter` node roles | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the arbiter must not be the leader, then that's an additional feature that needs to be listed and designed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussion on what we want to happen here. We're okay with the Arbiter node etcd member becoming leader. We're under the impression that etcd will be as fast as it's slowest member, then this might be fine for now. I'll document this as an open question, and plan to revisit this as it's own potential EP, wdyt?
@tjungblu Any concerns with this assumption on my part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no functionality that can prevent it becoming a leader. I'm mostly concerned about the performance impact reported here: etcd-io/etcd#14501
There are some upstream k8s PRs in the works that enable grpc "circuit breakers", so this will get resolved at some point in the future
@zaneb - do you have any more feedback, or is this lgtm? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
/cc @rwsu
added expectations for assisted installer and agent based installer fix wording and spelling Signed-off-by: ehila <ehila@redhat.com>
@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.19." or "openshift-4.19.", but it targets "openshift-4.18" instead. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
### OLM Filter Addition | ||
|
||
With this change it would be prudent to add the ability to allow operators to | ||
specify which control plane topology they support. This gives us more guards | ||
against installing layered components on unsupported or unconsidered topologies | ||
like the Master+Arbiter in this enhancement. | ||
|
||
This is a nice to have before we go GA and we should be looking into adding it | ||
to OLM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. The OLM team has some thoughts related to this along the lines of generic cluster-provided properties and layered-product provided constraints. This seems to be one use case for that feature. We'll have to think carefully about implementing this in a more general "properties vs. constraints" kind of design.
I don't know that I would make your GA contingent on this feature though since OCP already has multiple topology variants and OLM does not have this feature yet.
One thing we could consider in the meantime is a hint annotation in the CSV that the OCP console could be made aware of. OCP console already supports some of these annotations, and uses them to filter views presented to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, I have it as a nice to have for GA, so we wont let be a critical part of our checklist. Thanks for your input Joe!
@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jerpeter1 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@eggfoobar: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
validate some of the desires in this proposal. In it's current version this | ||
proposal is not exhaustive but will be filled out as we implement these goals. | ||
|
||
We currently expect this feature to mainly be used by `baremetal` installs, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to prevent it from being used on other platforms? I'm concerned that this breaks many assumptions that CPMS makes about the control plane, and that it will be strictly incompatible. IIRC baremetal doesn't use CPMS at the moment, so perhaps that issue can be avoided for now
using cloud installs, since we can simply alter the manifests in the pipeline | ||
with out needing to change the installer. | ||
|
||
In the current POC, a cluster was created on AWS with 2 masters running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so what was CPMS doing in this cluster? I can't see any discussion of CPMS but we need to think about how we will handle this breaking its assumptions of having three control plane nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can answer this, since I've tested this as part of 2NO. It spams the logs about the configuration being invalid, but doesn't take any action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You didn't happen to have anything recorded that captures those logs do you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I could probably get you something like that this afternoon :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you happen to be testing, and could grab the CPMS logs, I'd like to see exactly what it's complaining about to understand if it's in a safe broken state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes, well that makes sense. You'd have to make sure the installer isn't trying to generate a CPMS with the incorrect number of nodes. Which for 2NO means no CPMS. For arbiter, not sure how that will work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Egli showed me how to disable CPMS in the manifests, which I think will cover our use case for 2NO.
@eggfoobar will probably have some thoughts on how to handle this properly for arbiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPMS also has a generator, that will try to create a CPMS manifest should the cluster not already have one (which serves for upgrades). We probably want a check in there for the topology and have it disable when the topology is not one that it supports 🤔 How is 2no signalled in the ControlPlaneTopology?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to add an enum for DualReplica, following the pattern of SingleReplica.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no CPMSs for platform:baremetal
afaik
to run or remove from this list. The list is small already so there might not be | ||
a desire to pre-optimize just yet. | ||
|
||
| Component | Component Outlook on Arbiter | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this list compare to a regular control plane machine? What's not running on the arbiter, that would be, were it a normal control plane machine?
@@ -137,6 +137,10 @@ topology migrations. | |||
|
|||
### Installer Changes | |||
|
|||
> Note: When we say `MachinePool` in the below statements, we are referring to | |||
> the installer config struct used for generating resources and NOT the Machine | |||
> Config Operator CR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's the Machine API not the MCO that has MachinePool CRs.
using cloud installs, since we can simply alter the manifests in the pipeline | ||
with out needing to change the installer. | ||
|
||
In the current POC, a cluster was created on AWS with 2 masters running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no CPMSs for platform:baremetal
afaik
service will contain different initial | ||
[node-roles](https://github.com/openshift/machine-config-operator/blob/7c5ae75515fe373e3eecee711c18b07d01b5e759/templates/master/01-master-kubelet/_base/units/kubelet.service.yaml#L31) | ||
and | ||
[taints](https://github.com/openshift/machine-config-operator/blob/7c5ae75515fe373e3eecee711c18b07d01b5e759/templates/master/01-master-kubelet/_base/units/kubelet.service.yaml#L43). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eggfoobar did we consider reusing the existing control-plane role tag and only adding an extra taint (which only the etcd components would tolerate) via the MachineSet? It seems like that would vastly reduce the complexity we'd have to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @zaneb, sorry for the late reply, didn’t see the notification for this until today.
That’s a great question, In short we hadn’t tried a different Taint but we did originally try master:NoSchedule
, that worked for the most part and we did consider using a different Taint instead but a couple of design considerations pulled us away from it.
- We needed a quasi control-plane node that supports HA but should be smaller and treated differently in terms of what runs on it and how we configure it. This node will definitely be smaller with different characteristics and in the future this might even be an ARM node or in a different location.
- This has the high potential to be disruptive to components and the platform so we need to be explicit about the state of the cluster and allow component owners the signaling to decide how their component handles this deployment type.
Due to the uniqueness of the control plane, being a master node with a different taint poses different challenges. A big one is that things become implicit and a bit ambiguous. We currently have a few operators that tolerate all taints on control plane nodes, etcd being one of them. Updating those components would be easy enough so we can exclude them from the arbiter node but they would need to still check some authoritative flag that we are indeed in an arbiter deployment for replica modifications. In this case I think we continue to keep the ControlPlane API change. So you can probably get away with having taints be your enforcement for what runs on the arbiter, but with the API change we’re taking just a different road to the same path. The ambiguity comes from the customers side, we run the risk of components installed later running on the arbiter that shouldn't because they tolerate all taints on masters or not run at all. The cluster will says 3 control-plane nodes, so I can see where that might be confusing and the customer and they might just remove the taint from googling for solutions. The upside here is that we make fewer installer changes but we just gotta make sure the authoritative flag controlPlane API is set.
However, once we started to think about configuration aspect, having a machine that inherits from master is something we don’t support due to stability concerns, plus high probability that we would want to alter the same fields as master node configs. So the more we thought about it, the more we came back to the intent, we needed a quasi control plane node and the correct way to represent that seemed to be an explicit node role that was clear in its purpose.
So the gamble is we pay a little complexity upfront or we would pay it later when customers already had workloads running and tough decisions would mean tough conversations. If I recall correctly those were the main ideas running around my head a few months back when we were experimenting with different implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other observation I would make is that since I've been part of OpenShift, there has been an assumption that control-plane nodes are expected to be homogenous. We softly enforce this for cloud-based IPI installs by allowing a single instance type to be provisioned for all control-plane nodes.
I know we would support breaking this assumption for UPI clusters, and it might also be possible to have mismatched control plane nodes in a baremetal IPI install, but I'm not aware of any testing that verifies that mismatched control plane sizes behave as expected.
Keeping this as a separate node role helps us make this boundary more explicit. This node is expected to be different, and will thus also be tested as such.
This enhancement describes adding a new control plane deployment type for OpenShift that would deploy OpenShift with 2 control plane nodes and 1 arbiter node that runs only critical components needed for HA.
For the component team leads, please check the box if you
lgtm
the changes in this enhancement.