You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.8.0
Where do you run it - cloud or metal? Bare Metal K8s
Are you running Postgres Operator in production? yes
Type of issue? question
Hi,
we use the Postgres operator (v1.8.0) to manage 50 to 200 Postgres clusters per Kubernetes cluster and it’s working great. Thank you.
In our larger Kubernetes clusters however, it can take quite some time (5 – 20min) until a change of a PostgreSQL CR gets picked up and applied by the operator. This applies also to the creation of new databases.
I don’t know if this kind of behaviour is expected for one operator handling so many Postgres clusters or if it can be improved.
There are some things that we already tried and didn’t have much effect on the performance:
Adding more resources to the Postgres Operator containers
(currently we use a limit of 2 CPUs and 500MiB memory, and the graphs in Prometheus don’t show that either CPU or memory are fully utilized)
Doubling the number of workers in the OperatorConfiguration from 8 to 16.
We also see some log messages in the operator that might indicate some performance loss on the Kubernetes API side, but that we can’t interpret properly at the moment. They more or less look all similar to:
I0907 13:40:40.505552 1 request.go:665] Waited for 1.197988032s due to client-side throttling, not priority and fairness, request: GET:https://xxx:443/api/v1/namespaces/yyy/serviceaccounts/postgres-pod
But this could also be a red herring.
If you have any pointers that you can share, this would be really helpful.
If not, that’s also totally fine too.
We really appreciate all the work that you put in this operator so far. Thanks.
The text was updated successfully, but these errors were encountered:
Hi @gc-jro, this is indeed an issue we face as well for operators managing hundreds of clusters. A bottleneck has been describe in good detail in #879. We have to find a way that sync problems in one cluster do not block other clusters in the same worker.
A first plan is to improve logs to actually find out how long the waiting time has been. Then we should tackle the bottleneck itself. It's on our ToDo list but given the time of the year we're very busy with internal operational workload.
I just ran into this issue that a CR change took 10 min. There are 300 Postgres clusters in total. Are there any configuration changes you could share that help speed things up while waiting for #879? I have set repair_period: 1m & resync_period: 5m and am unsure if it makes it better or worse.
Let me know if there is anything I could help like contributing code etc.. Thanks!
Hi,
we use the Postgres operator (v1.8.0) to manage 50 to 200 Postgres clusters per Kubernetes cluster and it’s working great. Thank you.
In our larger Kubernetes clusters however, it can take quite some time (5 – 20min) until a change of a PostgreSQL CR gets picked up and applied by the operator. This applies also to the creation of new databases.
I don’t know if this kind of behaviour is expected for one operator handling so many Postgres clusters or if it can be improved.
There are some things that we already tried and didn’t have much effect on the performance:
(currently we use a limit of 2 CPUs and 500MiB memory, and the graphs in Prometheus don’t show that either CPU or memory are fully utilized)
We also see some log messages in the operator that might indicate some performance loss on the Kubernetes API side, but that we can’t interpret properly at the moment. They more or less look all similar to:
But this could also be a red herring.
If you have any pointers that you can share, this would be really helpful.
If not, that’s also totally fine too.
We really appreciate all the work that you put in this operator so far. Thanks.
The text was updated successfully, but these errors were encountered: