-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WAL-G backup broken since 1.13.0, works in 1.12.2 #2747
Comments
Oh no! This doesn't sound nice. Can you share some snippets or your operator configuration and service account so we can try to replicate. Our setup is not that different but our backups continue to run. Quite a few things have change and in your case maybe require a different configuration. Spilo has some config means but likely not managable yet by the operator. |
@FxKu SA Yaml (manually deployed as additional resource, not part of zalando postgres helm)
postgres operator helm values
postgres cluster yaml
These are the config files of v1.12.2, but as stated earlier, the only thing I then changed is the helm chart version, and manually update spilo image and logical backup image as we use a private repo pull-through Hope this can help! |
What if you change the docker image to the previous one? ghcr.io/zalando/spilo-16:3.2-p3 Does the v1.13.0 continues to work then? |
@FxKu I can confirm that the issue lies with the spilo image, by using chart 1.13.0 but spilo on 3.2-p3 the WAL archiving works correctly. |
Sorry for chiming in, my question is not directly related but the snippets are very useful for my question. What is the reason of providing the wal_s3_bucket multiple times, once in operator config |
In this dummy example it indeed does not really have any benefit of defining it twice as they are the same. The reason I define it twice is because of the order of priority. All our postgres clusters use the S3 path provided in the operatorconfig as a default, but for some clusters (e.g. ones you want to share to create standby replicas) we want to overrule that S3 location to another bucket or path. |
thanks for the quick reply. it make sense with the override, not sure though why must it be set in the operatorconfig. that seems to be the cuplrit of the missing |
it is caused by wal-g changes: wal-g/wal-g#1377 i'm still investigating wal-g bug:
|
Please, answer some short questions which should help us to understand your problem / question better?
We create logical backups (pg_dump) and WAL-G+basebackups for our clusters.
We use a k8s service account which is bound to an IAM role for S3 access.
postgres-operator is deployed using helm.
When running operator version 1.12.2 (and spilo 16:3.2-p3), both logicalbackup cronjob and WAL-G+basebackups work as intented.
I validate the WAL backup using
PGUSER=postgres
envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh "/home/postgres/pgdata/pgroot/data"
When I update the postgres-operator to 1.13.0 (and spilo to 16:3.3-p1), the logicalbackups still work, but the WAL+basebackup do not work anymore.
When manually trying to create a basebackup with the same command, I get error:
It seems to be an error specific to using a service account assuming an IAM role to access S3, specifically when running basebackup.
Logicalbackup are able to put the pg_dump on S3 via the same authentication method
No other values were changed other than the spilo image, and helm chart version.
Let me know if you need additional information.
The text was updated successfully, but these errors were encountered: