Skip to content

Prometheus metrics endpoint #1529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Yannig
Copy link

@Yannig Yannig commented Jun 17, 2021

The purpose of this PR is to set up an entry point for Prometheus. For now, the metrics collected are relatively limited:

  • Number of bases created (pg_new_db)
  • Database synchronization status between Kube and the operator (pg_sync_status)

The goal is to be able to easily detect easily that the operator is no longer able to communicate with a PG cluster by setting up Network Policies that are a little too restrictive (any resemblance to real or existing facts is entirely possible).

I'm aware that the number of metrics is relatively limited and that it would be possible to obtain a lot more things. In fact, I would like to have a quick feedback on this feature.

@Yannig
Copy link
Author

Yannig commented Jun 22, 2021

The compiled image with this operator is available at the following location: quay.io/yannig/postgres-operator:v1.6.3

Here the PodMonitor and PrometheusRule I'm using to integrate the postgres operator inside Prometheus Operator.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: postgres-operator
  namespace: operators
spec:
  namespaceSelector:
    matchNames:
    - operators
  podMetricsEndpoints:
  - port: "http"
  selector:
    matchLabels:
      app.kubernetes.io/name: postgres-operator
      app.kubernetes.io/instance: postgres-operator
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: postgres-operator
  namespace: operators
spec:
  groups:
  - name: postgres-operator.rules
    rules:
    - alert: PostgresOperatorDBSyncStatus
      for: 15m
      expr: pg_sync_status == 0
      annotations:
        summary: "Enable to communicate with postgres DB cluster"
        description: "Postgres operator is unable to communicate directly with PG cluster. Maybe a network policies is to restrictive."
      labels:
        severity: critical

You will also need to define the port definition inside the operator deployment definition:

...
        image: quay.io/yannig/postgres-operator:v1.6.3
        imagePullPolicy: Always
        name: postgres-operator
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 250Mi
...

@Yannig
Copy link
Author

Yannig commented Jun 28, 2021

Sorry to insist but are you interested in this feature or not? Do you have any feedback for me on what should be done at least in addition?

We have implemented this in our cluster and it now allows us to know very easily if we have a communication problem between our bases and the operator.

@FxKu
Copy link
Member

FxKu commented Jul 7, 2021

@Yannig sorry to keep you waiting. I'm not sure we want to add a Prometheus dependency. I thought, people were usually solving this via sidecars.

@mboutet
Copy link

mboutet commented Jul 7, 2021

The sidecars are used to monitor postgres itself. This PR is for monitoring the operator and have visibility on some operator events such as failed sync.

@MPV
Copy link

MPV commented Sep 16, 2021

@Yannig Looks like there is a conflict in a file, care to take a look?

@FxKu Up for adding this?

@MPV MPV mentioned this pull request Sep 17, 2021
@Yannig
Copy link
Author

Yannig commented Sep 17, 2021

@MPV sure, I'll do a rebase with the current branch.

@Yannig
Copy link
Author

Yannig commented Sep 17, 2021

Rebase done with last version of the master branch

@Yannig
Copy link
Author

Yannig commented Sep 17, 2021

By the way a version 1.7.0 with this patch is available using this image: quay.io/yannig/postgres-operator:v1.7.0

@taxilian
Copy link

taxilian commented Oct 4, 2021

Twice now I've discovered that my replicas weren't syncing only when I ran out of space on one of them; I'm sure I made some kind of mistake which lead to the problem, but just having something which exported the "lag in MB" would be a huge benefit as it would allow me to set up monitoring and alerting for the problem.

@Yannig
Copy link
Author

Yannig commented Oct 5, 2021

@taxilian Sure, it could be a good feature. Maybe I can try to implement it.

@HaveFun83
Copy link

any news here?

@Starefossen
Copy link

Starefossen commented Jan 24, 2022

@FxKu can we get this in, please? 🙏🏻 We need a way to monitor sync status from the operator's point of view.

@Yannig Yannig force-pushed the prometheus-metrics branch from 5339cd5 to e1c171e Compare January 25, 2022 11:07
@sebastiangaiser
Copy link

sebastiangaiser commented Jun 2, 2022

Any news here? There is already 1 approval. I would be glad to see the PR merged 😉

@stephan2012
Copy link
Contributor

Any chances to get this finally merged?

@Jan-M
Copy link
Member

Jan-M commented Jan 16, 2023

Hi! Thanks for contributing and a fair share of patience. The idea is good and welcome, everyone wants to monitor things. I might just want to challenge, the new database counter, this feels more like an example, than real value. Lets maybe agree on what you did in terms of syncs, and expose success and failed sync count and maybe a total count of databases observed by the operator at any given time?

@FxKu FxKu added this to the 2.0 milestone Jan 16, 2023
@taxilian
Copy link

I don't think the OP necessarily anticipated that these would be the only metrics collected, rather was submitting something to provide a basic framework to start adding some.

Personally the thing I'd want to see most is the "Lag in MB" from patronictl status -- I have had one of the replicas stop syncing correctly a few times and there is no way to really grab it. Maybe sync status gives you that, but it sounds like it's more about something else and additionally it's not specific as to what is happening. I'd also want to know if it was just somehow not replicating fast enough, etc.

@jurim76
Copy link

jurim76 commented Feb 5, 2023

Personally the thing I'd want to see most is the "Lag in MB" from patronictl status -- I have had one of the replicas stop syncing correctly a few times and there is no way to really grab it.

Take a look at https://github.com/gopaytech/patroni_exporter
Could be implemented in zalando as a single sidecar or combined with another one (for example custom postgres/patroni exporter image)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing closing parenthesis in var requirePrimaryRestartWhenDecreased?

@FxKu FxKu modified the milestones: 1.11.0, 2.0 Jan 22, 2024
@teimyBr
Copy link

teimyBr commented Aug 9, 2024

Any news on that PR ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Waiting for review
Development

Successfully merging this pull request may close these issues.