Skip to content

Conversation

njhale
Copy link
Member

@njhale njhale commented Sep 29, 2021

Propose a feature toggle for Copied CSVs to alleviate associated resource consumption issues on large clusters.

## Drawbacks

- The Operator API is not a "drop-in-place" for Copied CSVs
- There is no equivalent to namespace scoped CSV `List` requests for `Operator` resources; i.e. users must perform individual `Get` requests for each `Operator` they want to inspect, and must comb through `Operator` status to determine if an operator can act on a target namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for us to project the install mode the operator was installed in for the purpose of providing some watch namespace metadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to leak old abstractions into our newer APIs if we can help it. I think the OperatorGroup is already considered a component of the Operator resource and it's conditions are aggregated generically. Can we make use of that instead of adding a first-class field?


#### Copied RBAC

In addition to operator discoverability, Copied CSVs also serve as convienient vehicles for the garbage collection of Copied RBAC. These are resources that are copied along with the CSV to provide the operator access to a namespace. To support the loss of Copied CSVs, the RBAC copying logic in OLM will be extended to write additional owner references from copied resources to the related Operator resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does that extended logic looks like, especially in the context of upgrading from an OLM release that doesn't contain this feature toggle? I'm likely missing some context on the copied RBAC process - are the copied CSVs placing owner references on the underlying RBAC for a particular namespace, or is the OG resource that owner?

@njhale njhale force-pushed the deprecate/copied-csvs branch from 332bb30 to 68df69c Compare October 13, 2021 14:49
@njhale njhale force-pushed the deprecate/copied-csvs branch 2 times, most recently from b046b72 to 3d9ccce Compare October 25, 2021 15:23
Today, Operator resources are generated on a per-Subscription basis. Since CSVs -- and thus Copied CSVs -- can exist without matching Subscriptions, OLM's controllers will be extended to generate additional Operator resources for these "free CSVs" such that:

- there exists at least one`Operator` resource for every CSV
- there exists at least one `Operator` resource for every `Subscription`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least one or exactly one? when would there be multiple?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least one -- we don't preclude users from adding Subscriptions to Operator resources manually, but other than that I don't think there are situations where OLM would generate more than one itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but when would there be more than one operator resource for a given CSV, or more than one operator resource for a given subscription object?

it sounds like you're saying there could be multiple subscriptions for a given operator resource, not multiple operator resources for a given subscription?

Copy link

@spadgett spadgett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhale Thanks for the heads up. I suspect this is quite a large impact to console. We'll need to investigate. Adding @TheRealJon who knows the console OLM code best.

cc @kdoberst @alimobrem @jwforres

@njhale njhale force-pushed the deprecate/copied-csvs branch 2 times, most recently from fbe20f2 to 673b9de Compare November 9, 2021 19:50
@njhale njhale changed the title doc(olm): deprecate copied csvs Toggle Copied CSVs Nov 10, 2021

#### OpenShift Console Integration

**Risk:** The OpenShift Console uses Copied CSVs in the rendering of certain pages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK other than the "Installed Operators" piece, the console (both admin and dev perspectives) relies heavily on Copied CSVs for rendering "Operands-related" pages.

The info arch in the admin console is:

"Installed Operators" page (i.e. a list of Copied CSVs) 
└── "Operator Details" page (e.g. Copied CSV of Strimzi)
            └── "Operand List" page (i.e. instance list of provided APIs through Copied CSV)
                        └── "Operand Details" page (e.g. kafka instance details through Copied CSV)

(For dev console: end-users directly create Operand instance from Dev Catalog page through Copied CSV)

So the risk of Copied CSVs being removed/toggled off is that the end-users (who only have access to their namespaces) would lose the ability to create/manage/delete Operand instances in the console.

e.g. if Copied CSV of Strimzi Operator is gone, as regular users who only have access to “dev” namespace, they can’t create kafka instance in the console through:
/k8s/ns/dev/clusterserviceversions/strimzi-cluster-operator.v0.26.0/kafka.strimzi.io~v1beta2~Kafka/~new

I think it's worth calling out the UI impact when toggled off Copied CSV in the doc:

  • if toggle off the Copied CSV, basically all UI workflows for end-users to manage Operand instances are crippled, so cluster admins are well-aware of this side-effect

  • (I assume this "Copied CSV toggle" is meant for a short-term way to reduce the cluster-loading) so it's important to let cluster admins know they can toggle Copied CSV back on to re-enable Operand instance managing workflow in the console for end-users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlwu2013 Thanks for this!

I used your comment to flesh out the Risks/Mitigations. Now, between this section and the "Loss of Operator Discovery" section before it, I think I've captured your concerns. PTAL.


#### Cons

- Aggregated APIs generally lead to cluster-instability
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have evidence for that? AFAIK packagemanifest is such a aggregated API service. Is it known to be instable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the chief complaint is that aggregated APIs can break k8s discovery, badly. I can try to dig up the k8s issues that reference this for posterity.

To answer your other question: No, to my knowledge packageserver hasn't been a troublemaker 😄

#### Cons

- Aggregated APIs generally lead to cluster-instability
- Aggregated APIs require serving cert management and rotation; i.e. increased installation and lifecycle burden
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could enable users to opt-out of this feature in such a case, which would only put the burden on those users, which require that feature. Many platforms also come with on-board certificate management and rotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair points; although, I was under the assumption that we've already discussed and dismissed this approach in other venues. Is that not the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like a big hammer of a solution to use in the short term, given the longer term direction we are taking in solving this problem.

@njhale njhale force-pushed the deprecate/copied-csvs branch from 673b9de to ed578b9 Compare November 16, 2021 14:23
@kevinrizza
Copy link
Member

/approve


#### Orphaned CSVs

Orphaned CSVs are CSVs that were installed via a `Subscription` that no longer exists. Since the lifetimes -- i.e. the period of existence on a cluster -- of `Operator` resources aren't directly tied to the lifetimes of the resources that triggered their generation, the `Operator` resource(s) associated with Orphaned CSVs will continue to exist _after_ a `Subscription` is deleted. If that `Operator` resource is deleted explicitly, OLM will generate another as per the rules [outlined above](#operator-cardinality).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and presumably the Operator resource will be deleted when all the CSVs associated with it go away?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No -- it was an explicit design decision for "componentless" Operator resources to stick around to:

  • support situations where Operator resources are created manually
  • prevent eventual consistency from causing controllers to flap Operators; i.e. if components exist but have yet to be noticed up by the Operator controller, the respective Operator shouldn't be deleted only to be created again once components are noticed

"component" = resource associated with a given Operator resource


**Potential Mitigations:**

- Reduce number of impacted users by limiting toggle to operators installed in `AllNamespace` mode only; that is, `Single` and `MultiNamespace` mode Copied CSVs will not be affected by the toggle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this better, or will it be more confusing the user that they see some, but not all, of the installed operators?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be more confusing, but I've received feedback in other venues along the lines of "the confusion is worth the benefit of retaining discovery for users of Single/MultiNamespace operators"

I thought this was reasonable, which is why I changed the EP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. eventually having the console provide some indication that this is the situation would definitely be useful.


**Potential Mitigations:**

- Console detects when Copied CSVs are diabled and displays a warning to admins and affected users
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear from this EP if the plan is to have the console do this as part of implementing this EP, or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just suggestions from my side. I didn't have any expectations about when/if they would be implemented.

#### Cons

- Aggregated APIs generally lead to cluster-instability
- Aggregated APIs require serving cert management and rotation; i.e. increased installation and lifecycle burden
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like a big hammer of a solution to use in the short term, given the longer term direction we are taking in solving this problem.

Propose a feature toggle for Copied CSVs to alleviate associated
resource consumption issues on large clusters.

Signed-off-by: Nick Hale <njohnhale@gmail.com>
@njhale njhale force-pushed the deprecate/copied-csvs branch from ed578b9 to fe1f533 Compare November 16, 2021 19:16

The toggle will be disabled by default.

The toggle can be set from a command-line option on the olm-operator binary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an example of how one would toggle this feature on with a flag?

'disable-copied-csvs': true
```

When both the [command-line option](#the-toggle) _and_ `cluster` resource toggle field are present, the latter takes precedence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily opposed, but what's the motivation for having two ways to set the toggle? Seems like it would lead to user confusion due to precedence ambiguity (i.e. does anyone actually read the docs?! 😛)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair point.

My thinking at the time I wrote this:

  • lets you set config atomically; i.e. if I don't want to impose ordering rules between OLM's deployment and the config api
  • you could compose a config controller that updates an independent controller, rather than watching directly in OLM
    • this also gives you the option of making all command-line options opaque in the config schema (for better or worse)
  • feature toggles are typically implemented as command line options (as far as I'm aware)

Granted, that's all pretty watery. I'd be inclined to remove it if you feel it will negatively impact UX.

Copy link
Member

@awgreene awgreene Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having started implementing this feature with both the flag and the CRD, I support removing the flag. The existing implementation leads to a non-intuitive user experience.


- Reduce number of impacted users by limiting toggle to operators installed in `AllNamespace` mode only; that is, `Single` and `MultiNamespace` mode Copied CSVs will not be affected by the toggle.
- OLM fires an alert in all affected namespaces on startup when Copied CSVs are disabled
- Docs suggest admins notify their cluster tenants of installed operators directly; i.e. email, slack, etc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another mitigation: could we generate a Kubernetes Event in each namespace that lists the operators that are watching that namespace whenever the set changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I'll add it.


#### Config CRD w/ Toggle

A novel cluster-scoped `OLMConfig` CRD will be added to OLM. The resource type it defines will allow users to apply a limited set of configurations to the OLM instance on their cluster by creating or modifying a singlton (of that type) on the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, updating this custom object when OLM is already running will cause OLM to reconfigure itself on the fly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

@bparees
Copy link
Contributor

bparees commented Nov 16, 2021

@njhale can we at least get the follow-up Jira created on the console team board for what we want to see them do here? I just want something to give me confidence we're not going to drop the ball on next steps after we land this.

@bparees
Copy link
Contributor

bparees commented Nov 16, 2021

this looks ok to me.

@kevinrizza
Copy link
Member

Some small things still to address (mostly around the future of observability and not expressly with a toggle that just disables this behavior) but I am going to merge this now, and follow up conversations can happen during implementation

@kevinrizza kevinrizza merged commit 6942363 into operator-framework:master Nov 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.