-
Notifications
You must be signed in to change notification settings - Fork 38
Toggle Copied CSVs #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,204 @@ | ||
--- | ||
title: toggle-copied-csvs | ||
authors: | ||
- "@njhale" | ||
reviewers: | ||
- "@TheRealJon" | ||
- "@spadgett" | ||
- "@bparees" | ||
- "@joelanford" | ||
- "@kevinrizza" | ||
approvers: | ||
- "@bparees" | ||
- "@joelanford" | ||
- "@kevinrizza" | ||
creation-date: 2021-09-27 | ||
last-updated: 2021-11-15 | ||
status: implementable | ||
see-also: | ||
- "/enhancements/olm-toggle-copied-csvs.md" | ||
--- | ||
|
||
# Toggle Copied CSVs | ||
|
||
## Release Signoff Checklist | ||
|
||
- [x] Enhancement is `implementable` | ||
- [x] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Operational readiness criteria is defined | ||
- [x] Graduation criteria for dev preview, tech preview, GA | ||
|
||
## Summary | ||
|
||
This proposal introduces a novel config [CRD](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/) along with a feature toggle that disables/enables the creation and management of [Copied CSVs](https://olm.operatorframework.io/docs/concepts/crds/operatorgroup/#copied-csvs), so that cluster admins can curb OLM's resource usage on constrained clusters. | ||
|
||
## Motivation | ||
|
||
When an Operator is installed by OLM, a stripped down copy -- e.g. a stub -- of its [CSV](https://olm.operatorframework.io/docs/concepts/crds/clusterserviceversion/) is created in every namespace the Operator is configured to watch (see [OperatorGroup](https://olm.operatorframework.io/docs/concepts/crds/operatorgroup/); collectively, these are known as _Copied CSVs_. This means that the number of Copied CSVs on a given cluster grows proportionally with both the number of namespaces and installed Operators. In the worst case, a cluster can have a Copied CSV for every Operator in every namespace, that is `|copied| = |operators|*|namespaces|`. For especially large clusters, with namespaces and installed operators tending in the hundreds or thousands, Copied CSVs consume an untenable amount of resources; e.g. OLM's memory usage, cluster Etcd limits, networking, etc. For these reasons, it's imperative that Copied CSVs be replaced with a more resource friendly successor. In the meantime, while no such successor exists, OLM should have a knob that helps curtail its high resource utilization. | ||
|
||
### Goals | ||
|
||
- Alleviate the resource utilization incurred by Copied CSVs | ||
- Don't block upgrades for customers of OpenShift | ||
|
||
### Non-Goals | ||
|
||
- Provide a replacement for Copied CSVs | ||
|
||
## Proposal | ||
|
||
Provide a CRD for OLM that allows users to enable and disable Copied CSVs for CSVs installed in `AllNamespace` mode, then for each non-Copied CSV, a single [Operator](https://github.com/operator-framework/enhancements/blob/master/enhancements/simplify-olm-apis.md#proposal) resource will be generated. This cluster-scoped resource will take the place of any related Copied CSVs. | ||
|
||
### User Stories | ||
|
||
#### Toggle Copied CSVs Off | ||
|
||
As a Cluster Admin, I want a way to disable Copied CSVs so that I can reduce OLM's resource consumption on my cluster. | ||
|
||
#### Toggle Copied CSVs On | ||
|
||
As a Cluster Admin, I want a way to enable Copied CSVs if I've reconfigured the cluster in ways that mitigate the problem; e.g. uninstalled operators from `AllNamespace` `OperatorGroups`. | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
#### Operator Cardinality | ||
|
||
Today, Operator resources are generated on a per-Subscription basis. Since CSVs -- and thus Copied CSVs -- can exist without matching Subscriptions, OLM's controllers will be extended to generate additional Operator resources for these "free CSVs" such that: | ||
|
||
- there exists at least one `Operator` resource for every CSV | ||
- there exists at least one `Operator` resource for every `Subscription` | ||
|
||
That is, all CSVs installed by a given `Subscription` share an `Operator` resource __and__ any CSVs that weren't installed by a `Subscription` are related to their own `Operator` resource. | ||
|
||
#### Orphaned CSVs | ||
|
||
Orphaned CSVs are CSVs that were installed via a `Subscription` that no longer exists. Since the lifetimes -- i.e. the period of existence on a cluster -- of `Operator` resources aren't directly tied to the lifetimes of the resources that triggered their generation, the `Operator` resource(s) associated with Orphaned CSVs will continue to exist _after_ a `Subscription` is deleted. If that `Operator` resource is deleted explicitly, OLM will generate another as per the rules [outlined above](#operator-cardinality). | ||
|
||
#### Copied RBAC | ||
|
||
In addition to operator discoverability, Copied CSVs also serve as convienient vehicles for the garbage collection of Copied RBAC. These are resources that are copied along with the CSV to provide the operator access to a namespace. To support the loss of Copied CSVs, the RBAC copying logic in OLM will be extended to write additional owner references from copied resources to the related Operator resource. | ||
njhale marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### The Toggle | ||
|
||
A feature toggle will be added to control OLM's use of Copied CSVs. When enabled (toggled on), OLM will delete all existing Copied CSVs and will not generate any more. When disabled (toggled off), OLM will generate Copied CSVs (much like it does today). | ||
njhale marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
When toggled (on then off), OLM will recreate any missing Copied CSVs and reassociate the owner references of related copied resources with them (i.e. the inverse operation of the one described in the [Copied RBAC](#copied-rbac) section). | ||
|
||
The toggle will be disabled by default. | ||
|
||
The toggle can be set from a command-line option on the olm-operator binary. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add an example of how one would toggle this feature on with a flag? |
||
|
||
#### Config CRD w/ Toggle | ||
|
||
A novel cluster-scoped `OLMConfig` CRD will be added to OLM. The resource type it defines will allow users to apply a limited set of configurations to the OLM instance on their cluster by creating or modifying a singlton (of that type) on the cluster. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to confirm, updating this custom object when OLM is already running will cause OLM to reconfigure itself on the fly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup. |
||
|
||
The name of that singleton will be `cluster` and OLM will ignore all other `OLMConfig` resources on a cluster. | ||
|
||
Its spec will contain a Copied CSV feature toggle field. | ||
|
||
e.g. | ||
|
||
```yaml | ||
apiVersion: operators.coreos.com/v1 | ||
kind: OLMConfig | ||
metadata: | ||
name: cluster | ||
spec: | ||
features: | ||
'disable-copied-csvs': true | ||
``` | ||
|
||
When both the [command-line option](#the-toggle) _and_ `cluster` resource toggle field are present, the latter takes precedence. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not necessarily opposed, but what's the motivation for having two ways to set the toggle? Seems like it would lead to user confusion due to precedence ambiguity (i.e. does anyone actually read the docs?! 😛) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a fair point. My thinking at the time I wrote this:
Granted, that's all pretty watery. I'd be inclined to remove it if you feel it will negatively impact UX. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having started implementing this feature with both the flag and the CRD, I support removing the flag. The existing implementation leads to a non-intuitive user experience. |
||
|
||
When the `cluster` resource isn't present, OLM uses its default settings (in this case, Copied CSVs would be **on** by default). | ||
|
||
### Risks and Mitigations | ||
|
||
#### Loss of Operator Discovery | ||
|
||
**Risks:** When Copied CSVs are disabled, users won't be able to discover what operators are configured to watch a particular namespace unless they have access to their install namespaces. | ||
|
||
**Mitigations:** | ||
|
||
- Reduce number of impacted users by limiting toggle to operators installed in `AllNamespace` mode only; that is, `Single` and `MultiNamespace` mode Copied CSVs will not be affected by the toggle. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this better, or will it be more confusing the user that they see some, but not all, of the installed operators? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it will be more confusing, but I've received feedback in other venues along the lines of "the confusion is worth the benefit of retaining discovery for users of Single/MultiNamespace operators" I thought this was reasonable, which is why I changed the EP. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok. eventually having the console provide some indication that this is the situation would definitely be useful. |
||
- OLM fires an alert in all affected namespaces on startup when Copied CSVs are disabled | ||
- Docs suggest admins notify their cluster tenants of installed operators directly; i.e. email, slack, etc | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another mitigation: could we generate a Kubernetes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea! I'll add it. |
||
|
||
#### OpenShift Console Integration | ||
|
||
**Risk:** The OpenShift Console uses Copied CSVs in the rendering of certain pages, so disabling Copied CSVs will negatively impact UX. | ||
|
||
For the **admin console**, Copied CSVs are used to "drill down" on the details of operators watching a particular namespace: | ||
|
||
"Installed Operators" page (i.e. a list of Copied CSVs) | ||
└── "Operator Details" page (e.g. Copied CSV of Strimzi) | ||
└── "Operand List" page (i.e. instance list of provided APIs through Copied CSV) | ||
└── "Operand Details" page (e.g. kafka instance details through Copied CSV) | ||
|
||
For the **dev console**, Copied CSVs drive the creation and tracking of CRs. | ||
|
||
**Mitigations:** | ||
|
||
- Impact to console UX when Copied CSVs are disabled is well documented. | ||
|
||
**Potential Further Mitigations:** | ||
|
||
> **Note:** The mitigations below are **suggestions** for mitigations that could be implemented in the Console if deemed valuable by its maintainers. They are not a prerequisite of any other solutions in this proposal. As such, they are **not** guaranteed implementation. | ||
|
||
- Console detects when Copied CSVs are diabled and displays a warning to admins and affected users | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not clear from this EP if the plan is to have the console do this as part of implementing this EP, or not? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are just suggestions from my side. I didn't have any expectations about when/if they would be implemented. |
||
- A toggle component is added to Console to make re-enabling Copied CSVs easier | ||
njhale marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- A link to revelant docs is displayed to admins and affected users | ||
|
||
|
||
## Drawbacks | ||
|
||
- [Loss of Operator Discovery](#loss-of-operator-discovery) | ||
|
||
## Alternatives | ||
|
||
### Command-line Option Only | ||
|
||
Update the olm-operator `Deployment` to toggle the Copied CSV feature. | ||
|
||
#### Cons | ||
|
||
On an OpenShift cluster, [a CVO override](https://docs.openshift.com/container-platform/4.8/logging/config/cluster-logging-maintenance-support.html#unmanaged-operators_cluster-logging-unsupported) would be required to prevent the updated command-line option in the olm-operator `Deployment` from reverting, but this also puts clusters into an unsupported state and prevents cluster upgrades. | ||
|
||
### OpenShift FeatureGates | ||
|
||
Define a new OpenShift [FeatureGate](https://github.com/openshift/api/blob/master/config/v1/types_feature.go) feature set that, when disabled, disables Copied CSVs. This would be preferred over any command-line option and would require no CVO override. | ||
|
||
#### Cons | ||
|
||
- Requires OLM to be OpenShift-aware | ||
- Could be addressed with a downstream-only controller that deploys OLM with the correct options | ||
- Requires changes OpenShift (a new feature set) | ||
|
||
### Config ConfigMap | ||
|
||
Instead of using a novel CRD to configure OLM, use a ConfigMap. | ||
|
||
#### Cons | ||
|
||
- ConfigMap fields are loosely typed; i.e. lacking validation at the API level | ||
|
||
### New Field (or Annotation) | ||
|
||
Add a new field or annotation to existing types such as `CSVs` or `OperatorGroups` to denote disabling/enabling Copied CSVs for related resources. | ||
|
||
#### Cons | ||
|
||
- CVO would revert changes to resources included with the payload; e.g. the `openshift-operators` `Namespace` or `OperatorGroup` | ||
- Changing existing schemas is questionably backportable (for OpenShift releases) | ||
- Giving the ability to toggle the feature for some portions of a cluster but not others opens the door to an inconsistent user experience; e.g. as a user, I must use two separate methods to be sure of operator availablity. | ||
|
||
### Virtual "InstalledOperator" API | ||
|
||
A "virtual resource" served by an aggregated API -- much like `PackageManifest` -- that provides a namespaced, CSV-like, view of the operators configured to interact with a given namespace. Such a resource would be a direct drop-in-place for a Copied CSV. | ||
|
||
#### Cons | ||
|
||
- Aggregated APIs generally lead to cluster-instability; e.g. breaking cluster API discovery | ||
- Aggregated APIs require serving cert management and rotation; i.e. increased installation and lifecycle burden | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we could enable users to opt-out of this feature in such a case, which would only put the burden on those users, which require that feature. Many platforms also come with on-board certificate management and rotation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair points; although, I was under the assumption that we've already discussed and dismissed this approach in other venues. Is that not the case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it feels like a big hammer of a solution to use in the short term, given the longer term direction we are taking in solving this problem. |
||
- Virtual resources may impact garbage collection in unforeseen ways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and presumably the Operator resource will be deleted when all the CSVs associated with it go away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No -- it was an explicit design decision for "componentless" Operator resources to stick around to:
"component" = resource associated with a given Operator resource