-
Notifications
You must be signed in to change notification settings - Fork 38
Toggle Copied CSVs #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Toggle Copied CSVs #92
Conversation
5e90729
to
332bb30
Compare
## Drawbacks | ||
|
||
- The Operator API is not a "drop-in-place" for Copied CSVs | ||
- There is no equivalent to namespace scoped CSV `List` requests for `Operator` resources; i.e. users must perform individual `Get` requests for each `Operator` they want to inspect, and must comb through `Operator` status to determine if an operator can act on a target namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense for us to project the install mode the operator was installed in for the purpose of providing some watch namespace metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to leak old abstractions into our newer APIs if we can help it. I think the OperatorGroup is already considered a component of the Operator
resource and it's conditions are aggregated generically. Can we make use of that instead of adding a first-class field?
|
||
#### Copied RBAC | ||
|
||
In addition to operator discoverability, Copied CSVs also serve as convienient vehicles for the garbage collection of Copied RBAC. These are resources that are copied along with the CSV to provide the operator access to a namespace. To support the loss of Copied CSVs, the RBAC copying logic in OLM will be extended to write additional owner references from copied resources to the related Operator resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does that extended logic looks like, especially in the context of upgrading from an OLM release that doesn't contain this feature toggle? I'm likely missing some context on the copied RBAC process - are the copied CSVs placing owner references on the underlying RBAC for a particular namespace, or is the OG resource that owner?
332bb30
to
68df69c
Compare
b046b72
to
3d9ccce
Compare
Today, Operator resources are generated on a per-Subscription basis. Since CSVs -- and thus Copied CSVs -- can exist without matching Subscriptions, OLM's controllers will be extended to generate additional Operator resources for these "free CSVs" such that: | ||
|
||
- there exists at least one`Operator` resource for every CSV | ||
- there exists at least one `Operator` resource for every `Subscription` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least one or exactly one? when would there be multiple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least one -- we don't preclude users from adding Subscriptions
to Operator
resources manually, but other than that I don't think there are situations where OLM would generate more than one itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but when would there be more than one operator resource for a given CSV, or more than one operator resource for a given subscription object?
it sounds like you're saying there could be multiple subscriptions for a given operator resource, not multiple operator resources for a given subscription?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhale Thanks for the heads up. I suspect this is quite a large impact to console. We'll need to investigate. Adding @TheRealJon who knows the console OLM code best.
fbe20f2
to
673b9de
Compare
|
||
#### OpenShift Console Integration | ||
|
||
**Risk:** The OpenShift Console uses Copied CSVs in the rendering of certain pages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK other than the "Installed Operators" piece, the console (both admin and dev perspectives) relies heavily on Copied CSVs for rendering "Operands-related" pages.
The info arch in the admin console is:
"Installed Operators" page (i.e. a list of Copied CSVs)
└── "Operator Details" page (e.g. Copied CSV of Strimzi)
└── "Operand List" page (i.e. instance list of provided APIs through Copied CSV)
└── "Operand Details" page (e.g. kafka instance details through Copied CSV)
(For dev console: end-users directly create Operand instance from Dev Catalog page through Copied CSV)
So the risk of Copied CSVs being removed/toggled off is that the end-users (who only have access to their namespaces) would lose the ability to create/manage/delete Operand instances in the console.
e.g. if Copied CSV of Strimzi Operator is gone, as regular users who only have access to “dev” namespace, they can’t create kafka instance in the console through:
/k8s/ns/dev/clusterserviceversions/strimzi-cluster-operator.v0.26.0/kafka.strimzi.io~v1beta2~Kafka/~new
I think it's worth calling out the UI impact when toggled off Copied CSV in the doc:
-
if toggle off the Copied CSV, basically all UI workflows for end-users to manage Operand instances are crippled, so cluster admins are well-aware of this side-effect
-
(I assume this "Copied CSV toggle" is meant for a short-term way to reduce the cluster-loading) so it's important to let cluster admins know they can toggle Copied CSV back on to re-enable Operand instance managing workflow in the console for end-users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlwu2013 Thanks for this!
I used your comment to flesh out the Risks/Mitigations. Now, between this section and the "Loss of Operator Discovery" section before it, I think I've captured your concerns. PTAL.
|
||
#### Cons | ||
|
||
- Aggregated APIs generally lead to cluster-instability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have evidence for that? AFAIK packagemanifest
is such a aggregated API service. Is it known to be instable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the chief complaint is that aggregated APIs can break k8s discovery, badly. I can try to dig up the k8s issues that reference this for posterity.
To answer your other question: No, to my knowledge packageserver
hasn't been a troublemaker 😄
#### Cons | ||
|
||
- Aggregated APIs generally lead to cluster-instability | ||
- Aggregated APIs require serving cert management and rotation; i.e. increased installation and lifecycle burden |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could enable users to opt-out of this feature in such a case, which would only put the burden on those users, which require that feature. Many platforms also come with on-board certificate management and rotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair points; although, I was under the assumption that we've already discussed and dismissed this approach in other venues. Is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels like a big hammer of a solution to use in the short term, given the longer term direction we are taking in solving this problem.
673b9de
to
ed578b9
Compare
/approve |
|
||
#### Orphaned CSVs | ||
|
||
Orphaned CSVs are CSVs that were installed via a `Subscription` that no longer exists. Since the lifetimes -- i.e. the period of existence on a cluster -- of `Operator` resources aren't directly tied to the lifetimes of the resources that triggered their generation, the `Operator` resource(s) associated with Orphaned CSVs will continue to exist _after_ a `Subscription` is deleted. If that `Operator` resource is deleted explicitly, OLM will generate another as per the rules [outlined above](#operator-cardinality). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and presumably the Operator resource will be deleted when all the CSVs associated with it go away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No -- it was an explicit design decision for "componentless" Operator resources to stick around to:
- support situations where Operator resources are created manually
- prevent eventual consistency from causing controllers to flap Operators; i.e. if components exist but have yet to be noticed up by the Operator controller, the respective Operator shouldn't be deleted only to be created again once components are noticed
"component" = resource associated with a given Operator resource
|
||
**Potential Mitigations:** | ||
|
||
- Reduce number of impacted users by limiting toggle to operators installed in `AllNamespace` mode only; that is, `Single` and `MultiNamespace` mode Copied CSVs will not be affected by the toggle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this better, or will it be more confusing the user that they see some, but not all, of the installed operators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be more confusing, but I've received feedback in other venues along the lines of "the confusion is worth the benefit of retaining discovery for users of Single/MultiNamespace operators"
I thought this was reasonable, which is why I changed the EP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. eventually having the console provide some indication that this is the situation would definitely be useful.
|
||
**Potential Mitigations:** | ||
|
||
- Console detects when Copied CSVs are diabled and displays a warning to admins and affected users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not clear from this EP if the plan is to have the console do this as part of implementing this EP, or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just suggestions from my side. I didn't have any expectations about when/if they would be implemented.
#### Cons | ||
|
||
- Aggregated APIs generally lead to cluster-instability | ||
- Aggregated APIs require serving cert management and rotation; i.e. increased installation and lifecycle burden |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels like a big hammer of a solution to use in the short term, given the longer term direction we are taking in solving this problem.
Propose a feature toggle for Copied CSVs to alleviate associated resource consumption issues on large clusters. Signed-off-by: Nick Hale <njohnhale@gmail.com>
ed578b9
to
fe1f533
Compare
|
||
The toggle will be disabled by default. | ||
|
||
The toggle can be set from a command-line option on the olm-operator binary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an example of how one would toggle this feature on with a flag?
'disable-copied-csvs': true | ||
``` | ||
|
||
When both the [command-line option](#the-toggle) _and_ `cluster` resource toggle field are present, the latter takes precedence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily opposed, but what's the motivation for having two ways to set the toggle? Seems like it would lead to user confusion due to precedence ambiguity (i.e. does anyone actually read the docs?! 😛)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fair point.
My thinking at the time I wrote this:
- lets you set config atomically; i.e. if I don't want to impose ordering rules between OLM's deployment and the config api
- you could compose a config controller that updates an independent controller, rather than watching directly in OLM
- this also gives you the option of making all command-line options opaque in the config schema (for better or worse)
- feature toggles are typically implemented as command line options (as far as I'm aware)
Granted, that's all pretty watery. I'd be inclined to remove it if you feel it will negatively impact UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having started implementing this feature with both the flag and the CRD, I support removing the flag. The existing implementation leads to a non-intuitive user experience.
|
||
- Reduce number of impacted users by limiting toggle to operators installed in `AllNamespace` mode only; that is, `Single` and `MultiNamespace` mode Copied CSVs will not be affected by the toggle. | ||
- OLM fires an alert in all affected namespaces on startup when Copied CSVs are disabled | ||
- Docs suggest admins notify their cluster tenants of installed operators directly; i.e. email, slack, etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another mitigation: could we generate a Kubernetes Event
in each namespace that lists the operators that are watching that namespace whenever the set changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I'll add it.
|
||
#### Config CRD w/ Toggle | ||
|
||
A novel cluster-scoped `OLMConfig` CRD will be added to OLM. The resource type it defines will allow users to apply a limited set of configurations to the OLM instance on their cluster by creating or modifying a singlton (of that type) on the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, updating this custom object when OLM is already running will cause OLM to reconfigure itself on the fly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
@njhale can we at least get the follow-up Jira created on the console team board for what we want to see them do here? I just want something to give me confidence we're not going to drop the ball on next steps after we land this. |
this looks ok to me. |
Some small things still to address (mostly around the future of observability and not expressly with a toggle that just disables this behavior) but I am going to merge this now, and follow up conversations can happen during implementation |
Propose a feature toggle for Copied CSVs to alleviate associated resource consumption issues on large clusters.