🐛 use operator cache provider for deprecation updates to limit calls to GRPC server #3490

joelanford · 2025-01-15T22:37:44Z

Motivation for the change:

Whenever the catalog operator reconciles a subscription, it ensures the deprecation conditions are up-to-date. In doing so, it currently "snapshots" the catalog source for that subscription, causing one ListBundles call and a GetPackage call for each package in the catalog. ListBundles calls and hundreds of GetPackages calls are heavy weight, and have a significant impact on the performance of the catalog pod, the catalog operator, and the network.

Since this happens for each subscription on the cluster, every time each of those subscriptions is reconciled, we very quickly start sending a constant stream of essentially duplicate GRPC requests to the catalog sources for those subscriptions.

This bug affects every single cluster running OLM where even a single subscription exists.

It so happens that there is another place in the catalog operator where it is important to talk to the catalog source GRPC servers: the resolver. In that code, an operator provider cache is used. This abstraction caches snapshot results and has mechanisms for invalidating the snapshots when appropriate.

Description of the change:

This PR extracts the cache setup from the depths of the resolver and sets it up to be shared by both the resolver and the deprecation condition updater. This PR also adds a new counter metrics and log line to the snapshot method that is useful to:

highlight excessive (and potentially unintentional) snapshot calls
prove that the deprecation condition handling code no longer causes a snapshot call per subscription reconciliation

I've structured this PR as two commits. The first adds the metric and log line. The second implements the fix. Reviewers can see the improvement by checking out the metric/log commit, running OLM, installing a single operator, and checking the metric (or logs). In my reproduction, the system snapshotted the catalog 20 times as the subscription was reconciled and settled. With the HEAD of the PR running, that number is reduced to 1.

Architectural changes:

The resolver cache is now re-used with the subscription deprecation condition updater

Testing remarks:

No test changes necessary. We may want to consider constructing a prometheus alerting rule that ensures that the new snapshot counts metric only increments once per sync interval (I think they TTL for the snapshot cache is 5m)

Reviewer Checklist

Signed-off-by: Joe Lanford <joe.lanford@gmail.com>

… GRPC server Signed-off-by: Joe Lanford <joe.lanford@gmail.com>

perdasilva · 2025-01-16T12:53:13Z

ooff - nicely done!

perdasilva · 2025-01-16T13:10:04Z

l GREAT tm! just trying to see if I can get a beat on the cpu measurements before and after (mostly for my curiosity and maybe thinking about how we can regression test the resource usage of olm and try to avoid these issues in the future =S)

perdasilva · 2025-01-16T15:03:18Z

yeah - checking it out with grafana, commit 1 took it down after installing a couple of operators. commit 2 keeps cpu steady. Very nice!

jianzhangbjz · 2025-01-21T09:10:59Z

yeah - checking it out with grafana, commit 1 took it down after installing a couple of operators.

Hi @perdasilva , may I know how/where to check it with Grafana? This https://telemeter-lts-dashboards.datahub.redhat.com/d/DCSXmFTWk5/olm-installation-usage?orgId=1? Thanks!

openshift-ci bot requested review from akihikokuroda and kevinrizza January 15, 2025 22:37

joelanford changed the title ~~use operator cache provider for deprecation updates to limit calls to GRPC server~~ 🐛 use operator cache provider for deprecation updates to limit calls to GRPC server Jan 15, 2025

joelanford added 2 commits January 15, 2025 20:25

add log and metric to instrument count of catalog source snapshots

e41dba4

Signed-off-by: Joe Lanford <joe.lanford@gmail.com>

use operator cache provider for deprecation updates to limit calls to…

0684f5b

… GRPC server Signed-off-by: Joe Lanford <joe.lanford@gmail.com>

joelanford force-pushed the fix-update-deprecations branch from f514bb6 to 0684f5b Compare January 16, 2025 01:26

perdasilva approved these changes Jan 16, 2025

View reviewed changes

perdasilva added this pull request to the merge queue Jan 16, 2025

Merged via the queue into operator-framework:master with commit 1274d54 Jan 16, 2025
12 checks passed

joelanford mentioned this pull request Feb 5, 2025

Change the default log level to Warning operator-framework/operator-registry#1557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 use operator cache provider for deprecation updates to limit calls to GRPC server #3490

🐛 use operator cache provider for deprecation updates to limit calls to GRPC server #3490

Uh oh!

joelanford commented Jan 15, 2025 •

edited

Loading

Uh oh!

perdasilva commented Jan 16, 2025

Uh oh!

perdasilva commented Jan 16, 2025

Uh oh!

perdasilva commented Jan 16, 2025

Uh oh!

Uh oh!

jianzhangbjz commented Jan 21, 2025

Uh oh!

Uh oh!

🐛 use operator cache provider for deprecation updates to limit calls to GRPC server #3490

🐛 use operator cache provider for deprecation updates to limit calls to GRPC server #3490

Uh oh!

Conversation

joelanford commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

perdasilva commented Jan 16, 2025

Uh oh!

perdasilva commented Jan 16, 2025

Uh oh!

perdasilva commented Jan 16, 2025

Uh oh!

Uh oh!

jianzhangbjz commented Jan 21, 2025

Uh oh!

Uh oh!

joelanford commented Jan 15, 2025 •

edited

Loading