🐛 use operator cache provider for deprecation updates to limit calls to GRPC server #3490
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation for the change:
Whenever the catalog operator reconciles a subscription, it ensures the deprecation conditions are up-to-date. In doing so, it currently "snapshots" the catalog source for that subscription, causing one
ListBundles
call and aGetPackage
call for each package in the catalog.ListBundles
calls and hundreds ofGetPackages
calls are heavy weight, and have a significant impact on the performance of the catalog pod, the catalog operator, and the network.Since this happens for each subscription on the cluster, every time each of those subscriptions is reconciled, we very quickly start sending a constant stream of essentially duplicate GRPC requests to the catalog sources for those subscriptions.
This bug affects every single cluster running OLM where even a single subscription exists.
It so happens that there is another place in the catalog operator where it is important to talk to the catalog source GRPC servers: the resolver. In that code, an operator provider cache is used. This abstraction caches snapshot results and has mechanisms for invalidating the snapshots when appropriate.
Description of the change:
This PR extracts the cache setup from the depths of the resolver and sets it up to be shared by both the resolver and the deprecation condition updater. This PR also adds a new counter metrics and log line to the snapshot method that is useful to:
I've structured this PR as two commits. The first adds the metric and log line. The second implements the fix. Reviewers can see the improvement by checking out the metric/log commit, running OLM, installing a single operator, and checking the metric (or logs). In my reproduction, the system snapshotted the catalog 20 times as the subscription was reconciled and settled. With the HEAD of the PR running, that number is reduced to 1.
Architectural changes:
The resolver cache is now re-used with the subscription deprecation condition updater
Testing remarks:
No test changes necessary. We may want to consider constructing a prometheus alerting rule that ensures that the new snapshot counts metric only increments once per sync interval (I think they TTL for the snapshot cache is 5m)
Reviewer Checklist
/doc
[FLAKE]
are truly flaky and have an issue