Prometheus metrics endpoint #1529

Yannig · 2021-06-17T14:46:12Z

The purpose of this PR is to set up an entry point for Prometheus. For now, the metrics collected are relatively limited:

Number of bases created (pg_new_db)
Database synchronization status between Kube and the operator (pg_sync_status)

The goal is to be able to easily detect easily that the operator is no longer able to communicate with a PG cluster by setting up Network Policies that are a little too restrictive (any resemblance to real or existing facts is entirely possible).

I'm aware that the number of metrics is relatively limited and that it would be possible to obtain a lot more things. In fact, I would like to have a quick feedback on this feature.

Yannig · 2021-06-22T13:03:32Z

The compiled image with this operator is available at the following location: quay.io/yannig/postgres-operator:v1.6.3

Here the PodMonitor and PrometheusRule I'm using to integrate the postgres operator inside Prometheus Operator.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: postgres-operator
  namespace: operators
spec:
  namespaceSelector:
    matchNames:
    - operators
  podMetricsEndpoints:
  - port: "http"
  selector:
    matchLabels:
      app.kubernetes.io/name: postgres-operator
      app.kubernetes.io/instance: postgres-operator
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: postgres-operator
  namespace: operators
spec:
  groups:
  - name: postgres-operator.rules
    rules:
    - alert: PostgresOperatorDBSyncStatus
      for: 15m
      expr: pg_sync_status == 0
      annotations:
        summary: "Enable to communicate with postgres DB cluster"
        description: "Postgres operator is unable to communicate directly with PG cluster. Maybe a network policies is to restrictive."
      labels:
        severity: critical

You will also need to define the port definition inside the operator deployment definition:

...
        image: quay.io/yannig/postgres-operator:v1.6.3
        imagePullPolicy: Always
        name: postgres-operator
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 250Mi
...

Yannig · 2021-06-28T16:32:38Z

Sorry to insist but are you interested in this feature or not? Do you have any feedback for me on what should be done at least in addition?

We have implemented this in our cluster and it now allows us to know very easily if we have a communication problem between our bases and the operator.

FxKu · 2021-07-07T09:28:09Z

@Yannig sorry to keep you waiting. I'm not sure we want to add a Prometheus dependency. I thought, people were usually solving this via sidecars.

mboutet · 2021-07-07T12:28:57Z

The sidecars are used to monitor postgres itself. This PR is for monitoring the operator and have visibility on some operator events such as failed sync.

MPV · 2021-09-16T14:58:10Z

@Yannig Looks like there is a conflict in a file, care to take a look?

@FxKu Up for adding this?

Yannig · 2021-09-17T06:37:17Z

@MPV sure, I'll do a rebase with the current branch.

Yannig · 2021-09-17T06:51:04Z

Rebase done with last version of the master branch

Yannig · 2021-09-17T07:12:33Z

By the way a version 1.7.0 with this patch is available using this image: quay.io/yannig/postgres-operator:v1.7.0

taxilian · 2021-10-04T21:32:54Z

Twice now I've discovered that my replicas weren't syncing only when I ran out of space on one of them; I'm sure I made some kind of mistake which lead to the problem, but just having something which exported the "lag in MB" would be a huge benefit as it would allow me to set up monitoring and alerting for the problem.

Yannig · 2021-10-05T08:07:53Z

@taxilian Sure, it could be a good feature. Maybe I can try to implement it.

HaveFun83 · 2022-01-03T11:11:23Z

any news here?

Starefossen · 2022-01-24T12:04:24Z

@FxKu can we get this in, please? 🙏🏻 We need a way to monitor sync status from the operator's point of view.

sebastiangaiser · 2022-06-02T12:43:18Z

Any news here? There is already 1 approval. I would be glad to see the PR merged 😉

stephan2012 · 2022-11-08T10:45:52Z

Any chances to get this finally merged?

Jan-M · 2023-01-16T17:29:14Z

Hi! Thanks for contributing and a fair share of patience. The idea is good and welcome, everyone wants to monitor things. I might just want to challenge, the new database counter, this feels more like an example, than real value. Lets maybe agree on what you did in terms of syncs, and expose success and failed sync count and maybe a total count of databases observed by the operator at any given time?

taxilian · 2023-01-16T20:35:33Z

I don't think the OP necessarily anticipated that these would be the only metrics collected, rather was submitting something to provide a basic framework to start adding some.

Personally the thing I'd want to see most is the "Lag in MB" from patronictl status -- I have had one of the replicas stop syncing correctly a few times and there is no way to really grab it. Maybe sync status gives you that, but it sounds like it's more about something else and additionally it's not specific as to what is happening. I'd also want to know if it was just somehow not replicating fast enough, etc.

jurim76 · 2023-02-05T13:28:05Z

Personally the thing I'd want to see most is the "Lag in MB" from patronictl status -- I have had one of the replicas stop syncing correctly a few times and there is no way to really grab it.

Take a look at https://github.com/gopaytech/patroni_exporter
Could be implemented in zalando as a single sidecar or combined with another one (for example custom postgres/patroni exporter image)

jurim76 · 2023-05-23T13:46:55Z

pkg/cluster/sync.go

missing closing parenthesis in var requirePrimaryRestartWhenDecreased?

teimyBr · 2024-08-09T22:28:52Z

Any news on that PR ?

Yannig requested review from CyberDem0n, erthalion, FxKu, Jan-M, RafiaSabih and sdudoladov as code owners June 17, 2021 14:46

Yannig mentioned this pull request Jun 17, 2021

Expose operator metrics (prometheus) #1189

Open

Yannig force-pushed the prometheus-metrics branch from 5108566 to 2b38d00 Compare June 22, 2021 12:49

MPV mentioned this pull request Sep 17, 2021

add monitoring #264

Open

Yannig force-pushed the prometheus-metrics branch from 2b38d00 to 5339cd5 Compare September 17, 2021 06:42

Starefossen approved these changes Jan 24, 2022

View reviewed changes

Yannig added 2 commits January 25, 2022 11:52

✨ Add prometheus metrics endpoint

f981d0e

✨ New prometheus metrics to reflect cluster status

e1c171e

Yannig force-pushed the prometheus-metrics branch from 5339cd5 to e1c171e Compare January 25, 2022 11:07

teimyBr mentioned this pull request Jan 13, 2023

Add prometheus metrics for postgres operator himself #2170

Open

FxKu added this to the 2.0 milestone Jan 16, 2023

jurim76 reviewed May 23, 2023

View reviewed changes

pkg/cluster/sync.go

Copy link

jurim76 May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing closing parenthesis in var requirePrimaryRestartWhenDecreased?

FxKu modified the milestones: 1.11.0, 2.0 Jan 22, 2024

Prometheus metrics endpoint #1529

Are you sure you want to change the base?

Prometheus metrics endpoint #1529

Conversation

Yannig commented Jun 17, 2021

Uh oh!

Yannig commented Jun 22, 2021

Uh oh!

Yannig commented Jun 28, 2021

Uh oh!

FxKu commented Jul 7, 2021

Uh oh!

mboutet commented Jul 7, 2021

Uh oh!

MPV commented Sep 16, 2021

Uh oh!

Yannig commented Sep 17, 2021

Uh oh!

Yannig commented Sep 17, 2021

Uh oh!

Yannig commented Sep 17, 2021

Uh oh!

taxilian commented Oct 4, 2021

Uh oh!

Yannig commented Oct 5, 2021

Uh oh!

HaveFun83 commented Jan 3, 2022

Uh oh!

Starefossen commented Jan 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastiangaiser commented Jun 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephan2012 commented Nov 8, 2022

Uh oh!

Jan-M commented Jan 16, 2023

Uh oh!

taxilian commented Jan 16, 2023

Uh oh!

jurim76 commented Feb 5, 2023

Uh oh!

jurim76 May 23, 2023

Choose a reason for hiding this comment

Uh oh!

teimyBr commented Aug 9, 2024

Uh oh!

Uh oh!

Starefossen commented Jan 24, 2022 •

edited

Loading

sebastiangaiser commented Jun 2, 2022 •

edited

Loading