Description
Please ensure you do the following when reporting a bug:
- Provide a concise description of what the bug is.
- Provide information about your environment.
- Provide clear steps to reproduce the bug.
- Attach applicable logs. Please do not attach screenshots showing logs unless you are unable to copy and paste the log data.
- Ensure any code / output examples are properly formatted for legibility.
Overview
When using pgBackRest on a schedule, the PGO will create CronJobs but doesn't set anything for the concurrencyPolicy; which can result in a large number of Pods being scheduled to run.
Environment
Please provide the following details:
- Platform:
GKE
- Platform Version: 1.22
- PGO Image Tag:
ubi8-5.2.0-0
- Postgres Version
14
- Storage: VPC
Steps to Reproduce
REPRO
Provide steps to get to the error condition:
- Configure pgBackrest on a schedule
- Introduce load on the Postgres system, such that a backup takes longer to complete than the schedule for the next backup job
- The second job will attempt to start a Pod, but Fail, triggering a storm of restarting Pods
EXPECTED
- If a CronJob backup is still in progress, do not start another one until it has completed
ACTUAL
- Multiple Jobs will attempt to start, and continue to fail because the first Job is still performing it's backup.
Logs
│ time="2022-10-27T21:15:35Z" level=info msg="crunchy-pgbackrest starts"
│ time="2022-10-27T21:15:35Z" level=info msg="debug flag set to false"
│ time="2022-10-27T21:15:35Z" level=info msg="backrest backup command requested"
│ time="2022-10-27T21:15:35Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=incr]"
│ time="2022-10-27T21:15:35Z" level=info msg="output=[]"
│ time="2022-10-27T21:15:35Z" level=info msg="stderr=[ERROR: [050]: unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock': Resource temporarily unavailable\n HINT: is another pgBackRest process running?\n]" │
│ time="2022-10-27T21:15:35Z" level=fatal msg="command terminated with exit code 50"
Additional Information
Kubernetes will launch the Job as scheduled, but when the Pod for that job exists with a non-success code, Kubernetes will treat it as a failure and attempt to re-launch the Pod. But then that Pod will also fail, so Kubernetes will attempt to launch another Pod, etc. etc.
In this case, the Pods are failing because another Job's Pod is still performing the backup. Since we know this will always result in a failure, we should prevent multiple backups from the same CronJob from executing concurrently.
It may be worthwhile to expose some of the other CronJob settings as well, but setting concurrencyPolicy
to Forbid
should solve most the noise.
Is this the only place we'd need to add this to?
https://github.com/CrunchyData/postgres-operator/blob/master/internal/controller/postgrescluster/pgbackrest.go#L2877-L2890