-
Notifications
You must be signed in to change notification settings - Fork 1k
Prometheus metrics endpoint #1529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
5108566
to
2b38d00
Compare
The compiled image with this operator is available at the following location: Here the PodMonitor and PrometheusRule I'm using to integrate the postgres operator inside Prometheus Operator. apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: postgres-operator
namespace: operators
spec:
namespaceSelector:
matchNames:
- operators
podMetricsEndpoints:
- port: "http"
selector:
matchLabels:
app.kubernetes.io/name: postgres-operator
app.kubernetes.io/instance: postgres-operator
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: postgres-operator
namespace: operators
spec:
groups:
- name: postgres-operator.rules
rules:
- alert: PostgresOperatorDBSyncStatus
for: 15m
expr: pg_sync_status == 0
annotations:
summary: "Enable to communicate with postgres DB cluster"
description: "Postgres operator is unable to communicate directly with PG cluster. Maybe a network policies is to restrictive."
labels:
severity: critical You will also need to define the port definition inside the operator deployment definition: ...
image: quay.io/yannig/postgres-operator:v1.6.3
imagePullPolicy: Always
name: postgres-operator
ports:
- containerPort: 8080
name: http
protocol: TCP
resources:
limits:
cpu: 500m
memory: 500Mi
requests:
cpu: 100m
memory: 250Mi
... |
Sorry to insist but are you interested in this feature or not? Do you have any feedback for me on what should be done at least in addition? We have implemented this in our cluster and it now allows us to know very easily if we have a communication problem between our bases and the operator. |
@Yannig sorry to keep you waiting. I'm not sure we want to add a Prometheus dependency. I thought, people were usually solving this via sidecars. |
The sidecars are used to monitor postgres itself. This PR is for monitoring the operator and have visibility on some operator events such as failed sync. |
@MPV sure, I'll do a rebase with the current branch. |
2b38d00
to
5339cd5
Compare
Rebase done with last version of the master branch |
By the way a version 1.7.0 with this patch is available using this image: quay.io/yannig/postgres-operator:v1.7.0 |
Twice now I've discovered that my replicas weren't syncing only when I ran out of space on one of them; I'm sure I made some kind of mistake which lead to the problem, but just having something which exported the "lag in MB" would be a huge benefit as it would allow me to set up monitoring and alerting for the problem. |
@taxilian Sure, it could be a good feature. Maybe I can try to implement it. |
any news here? |
@FxKu can we get this in, please? 🙏🏻 We need a way to monitor sync status from the operator's point of view. |
5339cd5
to
e1c171e
Compare
Any news here? There is already 1 approval. I would be glad to see the PR merged 😉 |
Any chances to get this finally merged? |
Hi! Thanks for contributing and a fair share of patience. The idea is good and welcome, everyone wants to monitor things. I might just want to challenge, the new database counter, this feels more like an example, than real value. Lets maybe agree on what you did in terms of syncs, and expose success and failed sync count and maybe a total count of databases observed by the operator at any given time? |
I don't think the OP necessarily anticipated that these would be the only metrics collected, rather was submitting something to provide a basic framework to start adding some. Personally the thing I'd want to see most is the "Lag in MB" from patronictl status -- I have had one of the replicas stop syncing correctly a few times and there is no way to really grab it. Maybe sync status gives you that, but it sounds like it's more about something else and additionally it's not specific as to what is happening. I'd also want to know if it was just somehow not replicating fast enough, etc. |
Take a look at https://github.com/gopaytech/patroni_exporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing closing parenthesis in var requirePrimaryRestartWhenDecreased
?
Any news on that PR ? |
The purpose of this PR is to set up an entry point for Prometheus. For now, the metrics collected are relatively limited:
The goal is to be able to easily detect easily that the operator is no longer able to communicate with a PG cluster by setting up Network Policies that are a little too restrictive (any resemblance to real or existing facts is entirely possible).
I'm aware that the number of metrics is relatively limited and that it would be possible to obtain a lot more things. In fact, I would like to have a quick feedback on this feature.