Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator performance - It takes 5 - 20 min to reconcile a CR change #2036

Open
gc-jro opened this issue Sep 7, 2022 · 2 comments
Open

Operator performance - It takes 5 - 20 min to reconcile a CR change #2036

gc-jro opened this issue Sep 7, 2022 · 2 comments

Comments

@gc-jro
Copy link

gc-jro commented Sep 7, 2022

  • Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.8.0
  • Where do you run it - cloud or metal? Bare Metal K8s
  • Are you running Postgres Operator in production? yes
  • Type of issue? question

Hi,

we use the Postgres operator (v1.8.0) to manage 50 to 200 Postgres clusters per Kubernetes cluster and it’s working great. Thank you.

In our larger Kubernetes clusters however, it can take quite some time (5 – 20min) until a change of a PostgreSQL CR gets picked up and applied by the operator. This applies also to the creation of new databases.

I don’t know if this kind of behaviour is expected for one operator handling so many Postgres clusters or if it can be improved.
There are some things that we already tried and didn’t have much effect on the performance:

  • Adding more resources to the Postgres Operator containers
    (currently we use a limit of 2 CPUs and 500MiB memory, and the graphs in Prometheus don’t show that either CPU or memory are fully utilized)
  • Doubling the number of workers in the OperatorConfiguration from 8 to 16.

We also see some log messages in the operator that might indicate some performance loss on the Kubernetes API side, but that we can’t interpret properly at the moment. They more or less look all similar to:

I0907 13:40:40.505552       1 request.go:665] Waited for 1.197988032s due to client-side throttling, not priority and fairness, request: GET:https://xxx:443/api/v1/namespaces/yyy/serviceaccounts/postgres-pod

But this could also be a red herring.

If you have any pointers that you can share, this would be really helpful.
If not, that’s also totally fine too.
We really appreciate all the work that you put in this operator so far. Thanks.

@FxKu
Copy link
Member

FxKu commented Sep 16, 2022

Hi @gc-jro, this is indeed an issue we face as well for operators managing hundreds of clusters. A bottleneck has been describe in good detail in #879. We have to find a way that sync problems in one cluster do not block other clusters in the same worker.

A first plan is to improve logs to actually find out how long the waiting time has been. Then we should tackle the bottleneck itself. It's on our ToDo list but given the time of the year we're very busy with internal operational workload.

@owenthereal
Copy link
Contributor

owenthereal commented Oct 1, 2023

I just ran into this issue that a CR change took 10 min. There are 300 Postgres clusters in total. Are there any configuration changes you could share that help speed things up while waiting for #879? I have set repair_period: 1m & resync_period: 5m and am unsure if it makes it better or worse.

Let me know if there is anything I could help like contributing code etc.. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants