When pgBackRest backup takes longer the next scheduled run, it will create a storm of Failed Pods

Please ensure you do the following when reporting a bug:

- [x] Provide a concise description of what the bug is.
- [x] Provide information about your environment.
- [x] Provide clear steps to reproduce the bug.
- [x] Attach applicable logs. Please do not attach screenshots showing logs unless you are unable to copy and paste the log data.
- [x] Ensure any code / output examples are [properly formatted](https://docs.github.com/en/github/writing-on-github/basic-writing-and-formatting-syntax#quoting-code) for legibility.

## Overview

When using pgBackRest on a schedule, the PGO will create CronJobs but doesn't set anything for the [concurrencyPolicy](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/cron-job-v1/); which can result in a large number of Pods being scheduled to run.

## Environment

Please provide the following details:

- Platform: `GKE`
- Platform Version: 1.22
- PGO Image Tag: `ubi8-5.2.0-0`
- Postgres Version `14`
- Storage: VPC

## Steps to Reproduce

### REPRO

Provide steps to get to the error condition:

1. Configure pgBackrest on a schedule
2. Introduce load on the Postgres system, such that a backup takes longer to complete than the schedule for the next backup job
3. The second job will attempt to start a Pod, but Fail, triggering a storm of restarting Pods

### EXPECTED

1. If a CronJob backup is still in progress, do not start another one until it has completed

### ACTUAL

1. Multiple Jobs will attempt to start, and continue to fail because the first Job is still performing it's backup.

## Logs
```
│ time="2022-10-27T21:15:35Z" level=info msg="crunchy-pgbackrest starts"                                                                                                                                                                                           
│ time="2022-10-27T21:15:35Z" level=info msg="debug flag set to false"                                                                                                                                                                                              
│ time="2022-10-27T21:15:35Z" level=info msg="backrest backup command requested"                                                                                                                                                                                    
│ time="2022-10-27T21:15:35Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=incr]"                                                                                                                                           
│ time="2022-10-27T21:15:35Z" level=info msg="output=[]"                                                                                                                                                                                                            
│ time="2022-10-27T21:15:35Z" level=info msg="stderr=[ERROR: [050]: unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock': Resource temporarily unavailable\n       HINT: is another pgBackRest process running?\n]"                                      │
│ time="2022-10-27T21:15:35Z" level=fatal msg="command terminated with exit code 50"      
```

## Additional Information

Kubernetes will launch the Job as scheduled, but when the Pod for that job exists with a non-success code, Kubernetes will treat it as a failure and attempt to re-launch the Pod. But then that Pod will also fail, so Kubernetes will attempt to launch another Pod, etc. etc.

In this case, the Pods are failing because another Job's Pod is still performing the backup. Since we know this will always result in a failure, we should prevent multiple backups from the same CronJob from executing concurrently.

It may be worthwhile to expose some of the other CronJob settings as well, but setting `concurrencyPolicy` to `Forbid` should solve most the noise.

Is this the only place we'd need to add this to?
https://github.com/CrunchyData/postgres-operator/blob/master/internal/controller/postgrescluster/pgbackrest.go#L2877-L2890

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When pgBackRest backup takes longer the next scheduled run, it will create a storm of Failed Pods #3439

Overview

Environment

Steps to Reproduce

REPRO

EXPECTED

ACTUAL

Logs

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

When pgBackRest backup takes longer the next scheduled run, it will create a storm of Failed Pods #3439

Description

Overview

Environment

Steps to Reproduce

REPRO

EXPECTED

ACTUAL

Logs

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions