Skip to content

Conversation

joelanford
Copy link
Member

@joelanford joelanford commented Jan 15, 2025

Motivation for the change:

Whenever the catalog operator reconciles a subscription, it ensures the deprecation conditions are up-to-date. In doing so, it currently "snapshots" the catalog source for that subscription, causing one ListBundles call and a GetPackage call for each package in the catalog. ListBundles calls and hundreds of GetPackages calls are heavy weight, and have a significant impact on the performance of the catalog pod, the catalog operator, and the network.

Since this happens for each subscription on the cluster, every time each of those subscriptions is reconciled, we very quickly start sending a constant stream of essentially duplicate GRPC requests to the catalog sources for those subscriptions.

This bug affects every single cluster running OLM where even a single subscription exists.

It so happens that there is another place in the catalog operator where it is important to talk to the catalog source GRPC servers: the resolver. In that code, an operator provider cache is used. This abstraction caches snapshot results and has mechanisms for invalidating the snapshots when appropriate.

Description of the change:

This PR extracts the cache setup from the depths of the resolver and sets it up to be shared by both the resolver and the deprecation condition updater. This PR also adds a new counter metrics and log line to the snapshot method that is useful to:

  • highlight excessive (and potentially unintentional) snapshot calls
  • prove that the deprecation condition handling code no longer causes a snapshot call per subscription reconciliation

I've structured this PR as two commits. The first adds the metric and log line. The second implements the fix. Reviewers can see the improvement by checking out the metric/log commit, running OLM, installing a single operator, and checking the metric (or logs). In my reproduction, the system snapshotted the catalog 20 times as the subscription was reconciled and settled. With the HEAD of the PR running, that number is reduced to 1.

Architectural changes:

The resolver cache is now re-used with the subscription deprecation condition updater

Testing remarks:

No test changes necessary. We may want to consider constructing a prometheus alerting rule that ensures that the new snapshot counts metric only increments once per sync interval (I think they TTL for the snapshot cache is 5m)

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

@joelanford joelanford changed the title use operator cache provider for deprecation updates to limit calls to GRPC server 🐛 use operator cache provider for deprecation updates to limit calls to GRPC server Jan 15, 2025
Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
… GRPC server

Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
@joelanford joelanford force-pushed the fix-update-deprecations branch from f514bb6 to 0684f5b Compare January 16, 2025 01:26
@perdasilva
Copy link
Collaborator

ooff - nicely done!

@perdasilva
Copy link
Collaborator

l GREAT tm! just trying to see if I can get a beat on the cpu measurements before and after (mostly for my curiosity and maybe thinking about how we can regression test the resource usage of olm and try to avoid these issues in the future =S)

@perdasilva
Copy link
Collaborator

yeah - checking it out with grafana, commit 1 took it down after installing a couple of operators. commit 2 keeps cpu steady. Very nice!

@perdasilva perdasilva added this pull request to the merge queue Jan 16, 2025
Merged via the queue into operator-framework:master with commit 1274d54 Jan 16, 2025
12 checks passed
@jianzhangbjz
Copy link
Contributor

yeah - checking it out with grafana, commit 1 took it down after installing a couple of operators.

Hi @perdasilva , may I know how/where to check it with Grafana? This https://telemeter-lts-dashboards.datahub.redhat.com/d/DCSXmFTWk5/olm-installation-usage?orgId=1? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants