Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL-G backup broken since 1.13.0, works in 1.12.2 #2747

Open
olivier-derom opened this issue Sep 2, 2024 · 8 comments
Open

WAL-G backup broken since 1.13.0, works in 1.12.2 #2747

olivier-derom opened this issue Sep 2, 2024 · 8 comments
Labels
bug spilo Issue more related to Spilo

Comments

@olivier-derom
Copy link

Please, answer some short questions which should help us to understand your problem / question better?

  • Which image of the operator are you using? v1.13.0
  • Where do you run it - cloud or metal? Kubernetes or OpenShift? EKS
  • Are you running Postgres Operator in production? yes
  • Type of issue? Bug report

We create logical backups (pg_dump) and WAL-G+basebackups for our clusters.
We use a k8s service account which is bound to an IAM role for S3 access.
postgres-operator is deployed using helm.

When running operator version 1.12.2 (and spilo 16:3.2-p3), both logicalbackup cronjob and WAL-G+basebackups work as intented.
I validate the WAL backup using
PGUSER=postgres
envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh "/home/postgres/pgdata/pgroot/data"

When I update the postgres-operator to 1.13.0 (and spilo to 16:3.3-p1), the logicalbackups still work, but the WAL+basebackup do not work anymore.
When manually trying to create a basebackup with the same command, I get error:

create S3 storage: create new AWS session: configure session: assume role by ARN: InvalidParameter: 1 validation error(s) found.
- minimum field size of 2, AssumeRoleInput.RoleSessionName.

It seems to be an error specific to using a service account assuming an IAM role to access S3, specifically when running basebackup.
Logicalbackup are able to put the pg_dump on S3 via the same authentication method

No other values were changed other than the spilo image, and helm chart version.

Let me know if you need additional information.

@FxKu
Copy link
Member

FxKu commented Sep 3, 2024

Oh no! This doesn't sound nice. Can you share some snippets or your operator configuration and service account so we can try to replicate. Our setup is not that different but our backups continue to run.

Quite a few things have change and in your case maybe require a different configuration. Spilo has some config means but likely not managable yet by the operator.

@FxKu FxKu added bug spilo Issue more related to Spilo labels Sep 3, 2024
@olivier-derom
Copy link
Author

@FxKu
Sure! Here are some snippets:

SA Yaml (manually deployed as additional resource, not part of zalando postgres helm)
apiVersion: v1
automountServiceAccountToken: true
imagePullSecrets:
- name: mysecret
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::0123456789:role/my-IAM-role-w-S3-access
  labels:
    app.kubernetes.io/instance: postgres-operator
  name: postgres-operator
  namespace: dataplatform-prod
postgres operator helm values
postgres-operator:
  image:
    registry: privaterepo.com
    repository: zalando/postgres-operator
  imagePullSecrets:
    - name: mysecret

  configGeneral:
    # Spilo docker image, update manually when updating the operator
    # Tag and registry are not split, so we must update this manually and cannot rely on Helm default values
    docker_image: privaterepo.com/zalando/spilo-16:3.2-p3

  configKubernetes:
    app.kubernetes.io/managed-by: postgres-operator
    enable_secrets_deletion: true
    watched_namespace: dataplatform-dev
    cluster_labels:
      application: spilo
    cluster_name_label: cluster-name
    pod_environment_configmap: postgres-pod-config

    enable_pod_antiaffinity: true
    enable_readiness_probe: true

  configPostgresPodResources:
    default_cpu_limit: "1"
    default_cpu_request: 100m
    default_memory_limit: 500Mi
    default_memory_request: 100Mi
    min_cpu_limit: 250m
    min_memory_limit: 250Mi

  configDebug:
    debug_logging: true
    enable_database_access: true

  configAwsOrGcp:
    AWS_REGION: eu-west-1
    WAL_S3_BUCKET: mybucket/postgres-operator/WAL

  configLogicalBackup:
    # prefix for the backup job name
    logical_backup_job_prefix: "logical-backup-"
    logical_backup_provider: "s3"
    logical_backup_s3_region: "eu-west-1"
    logical_backup_s3_sse: "AES256"
    logical_backup_cronjob_environment_secret: ""
    # S3 retention time for stored backups for example "2 week" or "7 days"
    # recommended to also put S3 lifecycle policy on the bucket
    logical_backup_s3_retention_time: ""
    logical_backup_schedule: "30 00 * * *" # daily at 00.30 AM
    # Image for pods of the logical backup job (default pg_dumpall), update manually when updating the operator
    # Tag and registry are not split, so we must update this manually and cannot rely on Helm default values
    logical_backup_docker_image: privaterepo.com/zalando/postgres-operator/logical-backup:v1.12.2
    logical_backup_s3_bucket: mybucket/postgres-operator/logical-backups

  serviceAccount:
    create: false
    # The name of the ServiceAccount to use.
    name: postgres-operator

  podServiceAccount:
    name: postgres-operator
postgres cluster yaml
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: nessie-metastore
spec:
  postgresql:
    version: "16"
  teamId: "dataplatform"
  volume:
    size: 10Gi
  numberOfInstances: 1
  users:
    nessie:
      - superuser
      - createdb
    nessiegc:
      - superuser
      - createdb
  databases:
    nessiegc: nessiegc
    metastore: nessie

  enableLogicalBackup: true

  env:
  - name: AWS_REGION
    value: eu-west-1
  - name: WAL_S3_BUCKET
    value: mybucket/postgres-operator/WAL
  - name: USE_WALG_BACKUP
    value: "true"
  - name: USE_WALG_RESTORE
    value: "true"
  - name: BACKUP_SCHEDULE
    value: "00 * * * *"
  - name: BACKUP_NUM_TO_RETAIN
    value: "96" # For 1 backup per hour, keep 4 days of base backups

These are the config files of v1.12.2, but as stated earlier, the only thing I then changed is the helm chart version, and manually update spilo image and logical backup image as we use a private repo pull-through

Hope this can help!

@FxKu
Copy link
Member

FxKu commented Sep 4, 2024

What if you change the docker image to the previous one? ghcr.io/zalando/spilo-16:3.2-p3 Does the v1.13.0 continues to work then?

@olivier-derom
Copy link
Author

@FxKu I can confirm that the issue lies with the spilo image, by using chart 1.13.0 but spilo on 3.2-p3 the WAL archiving works correctly.

@nrobert13
Copy link

Sorry for chiming in, my question is not directly related but the snippets are very useful for my question. What is the reason of providing the wal_s3_bucket multiple times, once in operator config configAwsOrGcp.WAL_S3_BUCKET and once in the postgresql resource env WAL_S3_BUCKET? there would be a third way in the pod_environment_secret/configmap as well. I tried providing it only in the pod_environment_secret to keep all the archiving related S3 config in one place, but the /run/etc/wal-e.d/env is missing, so I assume the backup is not working either.

@olivier-derom
Copy link
Author

olivier-derom commented Sep 25, 2024

Sorry for chiming in, my question is not directly related but the snippets are very useful for my question. What is the reason of providing the wal_s3_bucket multiple times, once in operator config configAwsOrGcp.WAL_S3_BUCKET and once in the postgresql resource env WAL_S3_BUCKET? there would be a third way in the pod_environment_secret/configmap as well. I tried providing it only in the pod_environment_secret to keep all the archiving related S3 config in one place, but the /run/etc/wal-e.d/env is missing, so I assume the backup is not working either.

In this dummy example it indeed does not really have any benefit of defining it twice as they are the same. The reason I define it twice is because of the order of priority. All our postgres clusters use the S3 path provided in the operatorconfig as a default, but for some clusters (e.g. ones you want to share to create standby replicas) we want to overrule that S3 location to another bucket or path.
This way if for some reason we want to change the default S3 path, it can be done on a single line, but we also have a way to overrule this default.
Not sure why your /run/etc/wal-e.d/env is missing.

@nrobert13
Copy link

thanks for the quick reply. it make sense with the override, not sure though why must it be set in the operatorconfig. that seems to be the cuplrit of the missing /run/etc/wal-e.d/env.

@moss2k13
Copy link

moss2k13 commented Oct 11, 2024

it is caused by wal-g changes: wal-g/wal-g#1377
last working wal-g version is: https://github.com/wal-g/wal-g/releases/tag/v2.0.1
last working postgres-operator version is indeed: https://github.com/zalando/postgres-operator/releases/tag/v1.12.2
last working spilo image version is: https://github.com/zalando/spilo/releases/tag/3.2-p3

i'm still investigating wal-g bug:
it expects to provide both now AWS_ROLE_ARN and AWS_ROLE_SESSION_NAME but at the same time it doesn't allow to provide IAM IRSA session name including :

root@temporal-postgresql-0:/home/postgres# wal-g --version
wal-g version v3.0.3	3f88f3c	2024.08.08_17:53:40	PostgreSQL


root@temporal-postgresql-0:/home/postgres# wal-g-v2.0.1 --version
wal-g version v2.0.1	b7d53dd	2022.08.25_09:34:20	PostgreSQL


root@temporal-postgresql-0:/home/postgres# export AWS_ROLE_SESSION_NAME=system:serviceaccount:automation-service:postgres-pod-sa


root@temporal-postgresql-0:/home/postgres# echo $AWS_ROLE_ARN
arn:aws:iam::111111111111:role/postgres-backup-role


root@temporal-postgresql-0:/home/postgres# envdir /run/etc/wal-e.d/env/ wal-g backup-list
ERROR: 2024/10/11 15:52:43.441470 configure primary storage: configure storage with prefix "s3://postgres-backup/spilo/temporal-postgresql/12075954-67d5-4764-a7ea-df5925ca27fc/wal/15": create S3 storage: create new AWS session: configure session: assume role by ARN: WebIdentityErr: failed to retrieve credentials
caused by: ValidationError: 1 validation error detected: Value 'system:serviceaccount:automation-service:postgres-pod-sa' at 'roleSessionName' failed to satisfy constraint: Member must satisfy regular expression pattern: [\w+=,.@-]*
	status code: 400, request id: ce6c3656-6228-4d5d-94a4-9ea1670d1cf6


root@temporal-postgresql-0:/home/postgres# unset AWS_ROLE_SESSION_NAME


root@temporal-postgresql-0:/home/postgres# envdir /run/etc/wal-e.d/env/ wal-g-v2.0.1 backup-list
name                          modified             wal_segment_backup_start
base_000000010000000000000004 2024-09-20T11:34:19Z 000000010000000000000004
base_000000010000000000000006 2024-09-20T12:00:03Z 000000010000000000000006
base_00000001000000000000001F 2024-09-21T00:00:03Z 00000001000000000000001F
base_000000010000000000000038 2024-09-21T12:00:03Z 000000010000000000000038
base_000000010000000000000051 2024-09-22T00:00:03Z 000000010000000000000051
base_00000001000000000000006A 2024-09-22T12:00:03Z 00000001000000000000006A
base_000000010000000000000083 2024-09-23T00:00:03Z 000000010000000000000083
base_00000001000000000000009C 2024-09-23T12:00:03Z 00000001000000000000009C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug spilo Issue more related to Spilo
Projects
None yet
Development

No branches or pull requests

4 participants