This page contains general Postgres troubleshooting topics.
- The
patronictl
tool - Database unavailable
- Database disk full
- Replication lagging
- Postgres status
SyncFailed
- Cluster member missing
- Postgres leader missing
The patronictl
tool is used to call a REST API that interacts with Postgres databases. It handles a variety of tasks, such as listing cluster members and the replication
status, configuring and restarting databases, and more.
(ncn-mw#
) The tool is installed in the database containers:
kubectl exec -it -n services keycloak-postgres-0 -c postgres -- su postgres
(postgres-container#
) Use the following command for more information on the patronictl
command:
patronictl --help
If there are no endpoints for the main service, then Patroni will mark the database as unavailable.
(ncn-mw#
) The following is an example for keycloak-postgres
where no endpoints are listed, which means the database is unavailable.
kubectl get endpoints keycloak-postgres -n services
Example output:
NAME ENDPOINTS AGE
keycloak-postgres <none> 3d22h
If the database is unavailable, then check if Database disk full is the cause of the issue.
(ncn-mw#
) Otherwise, check the postgres-operator
logs for errors.
kubectl logs -l app.kubernetes.io/name=postgres-operator -n services
(ncn-mw#
) The following is an example for keycloak-postgres
. One cluster member is failing to start because of a full pgdata
disk. This was likely due to
replication issues, which caused the pg_wal
files to grow.
POSTGRESQL=keycloak-postgres
NAMESPACE=services
kubectl exec "${POSTGRESQL}-1" -c postgres -it -n ${NAMESPACE} -- patronictl list
Example output:
+-------------------+---------------------+------------+--------+--------------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+------------+--------+--------------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.0.11 | | start failed | | unknown |
| keycloak-postgres | keycloak-postgres-1 | 10.44.0.7 | | running | 4 | 0 |
| keycloak-postgres | keycloak-postgres-2 | 10.36.0.40 | Leader | running | 4 | |
+-------------------+---------------------+------------+--------+--------------+----+-----------+
for i in {0..2}; do echo "${POSTGRESQL}-${i}:"; kubectl exec "${POSTGRESQL}-${i}" -n ${NAMESPACE} -c postgres -- df -h pgdata; done
Example output:
keycloak-postgres-0:
Filesystem Size Used Avail Use% Mounted on
/dev/sde 976M 960M 0 100% /home/postgres/pgdata
keycloak-postgres-1:
Filesystem Size Used Avail Use% Mounted on
/dev/rbd12 976M 152M 809M 16% /home/postgres/pgdata
keycloak-postgres-2:
Filesystem Size Used Avail Use% Mounted on
/dev/rbd3 976M 136M 825M 15% /home/postgres/pgdata
kubectl logs "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres | grep FATAL
Example output:
2021-07-14 17:52:48 UTC [30495]: [1-1] 60ef2470.771f 0 FATAL: could not write lock file "postmaster.pid": No space left on device
kubectl exec "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres -it -- du -h --max-depth 1 /home/postgres/pgdata/pgroot/data/pg_wal
To recover the cluster member that had failed to start because of disk pressure, attempt to reclaim some space on the pgdata
disk.
kubectl exec
into that pod, copy the logs off (optional), and then clear the logs in order to recover some disk space. Finally, restart the Postgres cluster and postgres-operator
.
-
(
ncn-mw#
) Copy off the logs (optional).kubectl cp "${POSTGRESQL}-1":/home/postgres/pgdata/pgroot/pg_log /tmp -c postgres -n ${NAMESPACE}
-
Clear the logs.
-
(
ncn-mw#
) Open a shell in the database container.kubectl exec "${POSTGRESQL}-1" -n ${NAMESPACE} -c postgres -it -- bash
-
(
postgres-container#
) Clear the logs and exit the container.for i in {0..7}; do > /home/postgres/pgdata/pgroot/pg_log/postgresql-$i.csv; done exit
-
-
(
ncn-mw#
) Restart the pods by deleting them.kubectl delete pod "${POSTGRESQL}-0" "${POSTGRESQL}-1" "${POSTGRESQL}-2" -n ${NAMESPACE}
-
(
ncn-mw#
) Restart the operator by deleting it.kubectl delete pod -l app.kubernetes.io/name=postgres-operator -n services
If disk issues persist or exist on multiple nodes and the above does not resolve the issue, then see the Recover from Postgres WAL Event procedure.
Postgres replication lag can be detected with Prometheus alerts and alert notifications (See
Configure Prometheus Email Alert Notifications). If replication lag is not caught early, it
can cause the disk mounted on /home/postgres/pgdata
to fill up and the database to stop running. If this issue is caught before the database stops, it can be easily remediated
using a patronictl
command to reinitialize the lagging cluster member.
When services have a Postgres cluster of pods, they need to be able to replicate data between them. When the pods are not able to replicate data, the database will become full.
The patronictl list
command will show the status of replication.
(ncn-mw#
) The following is an example where replication is working:
kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+-------------------+---------------------+------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+------------+--------+---------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.40.0.23 | Leader | running | 1 | |
| keycloak-postgres | keycloak-postgres-1 | 10.42.0.25 | | running | 1 | 0 |
| keycloak-postgres | keycloak-postgres-2 | 10.42.0.29 | | running | 1 | 0 |
+-------------------+---------------------+------------+--------+---------+----+-----------+
The following is an example where replication is broken:
+-------------------+---------------------+--------------+--------+----------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+-------------------+---------------------+--------------+--------+----------+----+-----------+
| keycloak-postgres | keycloak-postgres-0 | 10.42.10.22 | | starting | | unknown |
| keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running | 47 | |
| keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 | | running | 14 | 608 |
+-------------------+---------------------+--------------+--------+----------+----+-----------+
In the event that a state of broken Postgres replication persists and the space allocated for the WAL files fills up, the affected database will likely shut down and create a
state where it can be very difficult to recover. This can impact the reliability of the related service and may require that it be redeployed with data repopulation procedures.
If replication lag is caught and remediated before the database shuts down, replication can be recovered using patronictl reinit
.
A reinitialize will get the lagging replica member re-synced and replicating again. This should be done as soon as replication lag is detected. In the preceding example,
keycloak-postgres-0
and keycloak-postgres-2
were not replicating properly (unknown
or non-zero lag). To remediate, kubectl exec
into the leader pod and use
patronictl reinit <cluster> <lagging cluster member>
to reinitialize the lagging member(s).
For example:
-
(
ncn-mw#
) Open a shell into the Postgres container.kubectl exec keycloak-postgres-1 -c postgres -n services -it -- bash
-
(
postgres-container#
) Reinitialize the first lagging replica member.patronictl reinit keycloak-postgres keycloak-postgres-0
Example output:
Are you sure you want to reinitialize members keycloak-postgres-0? [y/N]: y Failed: reinitialize for member keycloak-postgres-0, status code=503, (restarting after failure already in progress) Do you want to cancel it and reinitialize anyway? [y/N]: y Success: reinitialize for member keycloak-postgres-0
-
(
postgres-container#
) Reinitialize the next lagging replica member.patronictl reinit keycloak-postgres keycloak-postgres-2
Example output:
Are you sure you want to reinitialize members keycloak-postgres-2? [y/N]: y Failed: reinitialize for member keycloak-postgres-2, status code=503, (restarting after failure already in progress) Do you want to cancel it and reinitialize anyway? [y/N]: y Success: reinitialize for member keycloak-postgres-2
-
(
ncn-mw#
) Verify that replication has recovered.kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+-------------------+---------------------+--------------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-------------------+---------------------+--------------+--------+---------+----+-----------+ | keycloak-postgres | keycloak-postgres-0 | 10.42.10.22 | | running | 47 | 0 | | keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running | 47 | | | keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 | | running | 47 | 0 | +-------------------+---------------------+--------------+--------+---------+----+-----------+
-
If
patronictl reinit
fails withFailed: reinitialize for member
...status code=503, (Cluster has no leader, can not reinitialize)
:(
ncn-mw#
) For example:kubectl exec -it cray-console-data-postgres-0 -c postgres -n services -- patronictl reinit cray-console-data-postgres cray-console-data-postgres-1
Example output:
+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +------------------------------+------------+--------+---------+----+-----------+ | cray-console-data-postgres-0 | 10.39.0.74 | Leader | running | 1 | | | cray-console-data-postgres-1 | 10.36.0.37 | | running | 1 | 16 | | cray-console-data-postgres-2 | 10.32.0.1 | | running | 1 | 16 | +------------------------------+------------+--------+---------+----+-----------+ Are you sure you want to reinitialize members cray-console-data-postgres-1? [y/N]: y Failed: reinitialize for member cray-console-data-postgres-1, status code=503, (Cluster has no leader, can not reinitialize)
-
Delete the Postgres leader pod and wait for the leader to restart.
-
(
ncn-mw#
) Delete the leader pod.kubectl delete pod cray-console-data-postgres-0 -n services
-
(
ncn-mw#
) Wait for the leader to restart.Re-run the following command until it succeeds and reports that the leader pod is
running
.kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +------------------------------+------------+--------+---------+----+-----------+ | cray-console-data-postgres-0 | 10.39.0.80 | Leader | running | 2 | | | cray-console-data-postgres-1 | 10.36.0.37 | | running | 1 | 49 | | cray-console-data-postgres-2 | 10.32.0.1 | | running | 1 | 49 | +------------------------------+------------+--------+---------+----+-----------+
-
-
Re-run Recover replication to
reinit
any lagging members. -
If the
reinit
still fails, then delete member pods that are still reporting lag. This should clear up any remaining lag.-
(
ncn-mw#
) Determine which pods are reporting lag.kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +------------------------------+------------+--------+---------+----+-----------+ | cray-console-data-postgres-0 | 10.39.0.80 | Leader | running | 2 | | | cray-console-data-postgres-1 | 10.36.0.37 | | running | 1 | 49 | | cray-console-data-postgres-2 | 10.32.0.1 | | running | 1 | 49 | +------------------------------+------------+--------+---------+----+-----------+
-
(
ncn-mw#
) Delete the pods that are still reporting lag.kubectl delete pod cray-console-data-postgres-1 cray-console-data-postgres-2 -n services
-
(
ncn-mw#
) Once the pods restart, verify that the lag has resolved.kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+ Cluster: cray-console-data-postgres (7072784871993835594) ---+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +------------------------------+------------+--------+---------+----+-----------+ | cray-console-data-postgres-0 | 10.39.0.80 | Leader | running | 2 | | | cray-console-data-postgres-1 | 10.36.0.37 | | running | 2 | 0 | | cray-console-data-postgres-2 | 10.32.0.1 | | running | 2 | 0 | +------------------------------+------------+--------+---------+----+-----------+
-
-
-
If a cluster member is
stopped
after a successful reinitialization, check forpg_internal.init.*
files that may need to be cleaned up. This can occur if thepgdata
disk was full prior to the reinitialization, leaving truncatedpg_internal.init.*
files in thepgdata
directory.-
(
ncn-mw#
) Determine if any pods arestopped
.kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+-------------------+---------------------+--------------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-------------------+---------------------+--------------+--------+---------+----+-----------+ | keycloak-postgres | keycloak-postgres-0 | 10.42.10.22 | | running | 47 | 0 | | keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running | 47 | | | keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 | | stopped | | unknown | +-------------------+---------------------+--------------+--------+---------+----+-----------+
-
Check the most recent Postgres log in the
stopped
pod.-
(
ncn-mw#
) Open a shell into that pod that isstopped
.kubectl exec keycloak-postgres-2 -c postgres -n services -it -- bash
-
(
postgres-container#
) Check the most recent Postgres log for anyinvalid segment number 0
errors relating topg_internal.init.*
files.LOG=`ls -t /home/postgres/pgdata/pgroot/pg_log/*.csv | head -1` grep pg_internal.init "$LOG" | grep "invalid segment number 0" | tail -1
Example output:
2022-02-01 16:59:35.529 UTC,"standby","",227600,"127.0.0.1:42264",61f966f7.37910,3,"sending backup ""pg_basebackup base backup""",2022-02-01 16:59:35 UTC,7/0,0,ERROR,XX000,"invalid segment number 0 in file ""pg_internal.init.2239188""",,,,,,,,,"pg_basebackup"
-
-
(
postgres-container#
) If the check in the previous step finds such files, then first find any zero lengthpg_internal.init.*
files.This command should be run inside the
kubectl exec
session from the previous steps.find /home/postgres/pgdata -name pg_internal.init.* -size 0
Example output:
./pgroot/data/base/16622/pg_internal.init.2239004 ... ./pgroot/data/base/16622/pg_internal.init.2239010
-
(
postgres-container#
) Delete the zero lengthpg_internal.init.*
files.This command should be run inside the
kubectl exec
session from the previous steps. Double check the syntax of the command in this step before executing it.find /home/postgres/pgdata -name pg_internal.init.* -size 0 -exec rm {} \;
-
(
postgres-container#
) Find any non-zero lengthpg_internal.init.*
files that were truncated when the file system filled up.This command should be run inside the
kubectl exec
session from the previous step.grep pg_internal.init $LOG | grep "invalid segment number 0" | tail -1
Example output:
2022-02-01 16:59:35.529 UTC,"standby","",227600,"127.0.0.1:42264",61f966f7.37910,3,"sending backup ""pg_basebackup base backup""",2022-02-01 16:59:35 UTC,7/0,0,ERROR,XX000,"invalid segment number 0 in file ""pg_internal.init.2239188""",,,,,,,,,"pg_basebackup"
-
(
postgres-container#
) Locate the non-zero lengthpg_internal.init.*
file.This command should be run inside the
kubectl exec
session from the previous step.find ~/pgdata -name pg_internal.init.2239188
Example output:
/home/postgres/pgdata/pgroot/data/base/16622/pg_internal.init.2239188
-
(
postgres-container#
) Delete (or move to a different location) the non-zero lengthpg_internal.init.*
file.This command should be run inside the
kubectl exec
session from the previous step.rm -v /home/postgres/pgdata/pgroot/data/base/16622/pg_internal.init.2239188
-
Repeat the above steps to find, locate, and delete non-zero length
pg_internal.init.*
files until there are no more newinvalid segment number 0
messages. -
(
ncn-mw#
) Verify that the cluster member has started.kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
Example output:
+-------------------+---------------------+--------------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-------------------+---------------------+--------------+--------+---------+----+-----------+ | keycloak-postgres | keycloak-postgres-0 | 10.42.10.22 | | running | 47 | 0 | | keycloak-postgres | keycloak-postgres-1 | 10.40.11.191 | Leader | running | 47 | | | keycloak-postgres | keycloak-postgres-2 | 10.40.11.190 | | running | 47 | 0 | +-------------------+---------------------+--------------+--------+---------+----+-----------+
-
Alerts exist in Prometheus for the following:
PostgresqlReplicationLagSMA
PostgresqlReplicationLagServices
PostgresqlFollowerReplicationLagSMA
PostgresqlFollowerReplicationLagServices
When alert notifications are configured, replication issues can be detected quickly. If the replication issue persists such that the database becomes unavailable, recovery will likely be much more involved. Catching such issues as soon as possible is desired. See Configure Prometheus Email Alert Notifications.
Check for any postgresql
resource that has a STATUS
of SyncFailed
. SyncFailed
generally means that there is something between the postgres-operator
and the Postgres
cluster that is out of sync. This does not always mean that the cluster is unhealthy. Check the postgres-operator
logs for messages in order to further determine the root
cause of the issue.
Other STATUS
values such as Updating
are a non-issue. It is expected that this will eventually change to Running
or possibly SyncFailed
, if the postgres-operator
encounters issues syncing updates to the postgresql
cluster.
-
(
ncn-mw#
) Check for anypostgresql
resource that has aSTATUS
ofSyncFailed
.kubectl get postgresql -A
Example output:
NAMESPACE NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS services cray-console-data-postgres cray-console-data 11 3 2Gi 4h10m Running services cray-sls-postgres cray-sls 11 3 1Gi 4h12m SyncFailed services cray-smd-postgres cray-smd 11 3 30Gi 500m 8Gi 4h12m Updating services gitea-vcs-postgres gitea-vcs 11 3 50Gi 4h11m Running services keycloak-postgres keycloak 11 3 1Gi 4h13m Running spire spire-postgres spire 11 3 20Gi 1 4Gi 4h10m Running
-
(
ncn-mw#
) Find thepostgres-operator
pod name.kubectl get pods -l app.kubernetes.io/name=postgres-operator -n services
Example output:
NAME READY STATUS RESTARTS AGE cray-postgres-operator-6fffc48b4c-mqz7z 2/2 Running 0 5h26m
-
(
ncn-mw#
) Check the logs for thepostgres-operator
.kubectl logs cray-postgres-operator-6fffc48b4c-mqz7z -n services \ -c postgres-operator | grep -i sync | grep -i msg
msg="could not sync cluster: could not sync persistent volumes: could not sync volumes: could not resize EBS volumes: some persistent volumes are not compatible with existing resizing providers"
This generally means that the postgresql
resource was updated to change the volume size from the Postgres operator's perspective, but the additional step to resize the
actual PVCs was not done so the operator and the Postgres cluster are not able to sync the resize change. The cluster is still healthy, but to complete the resize of the
underlying Postgres PVCs, additional steps are needed.
The following example assumes that cray-smd-postgres
is in SyncFailed
and the volume size was recently increased to 100Gi
(possibly by editing the volume size of
postgresql
cray-smd-postgres
resource), but the pgdata-cray-smd-postgres
PVC's storage capacity was not updated to align with the change.
(ncn-mw#
) To confirm that this is the case:
kubectl get postgresql cray-smd-postgres -n services -o jsonpath="{.spec.volume.size}"
Example output:
100Gi
kubectl get pvc -n services -l application=spilo,cluster-name=cray-smd-postgres
Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
pgdata-cray-smd-postgres-0 Bound pvc-020cf339-e372-46ae-bc37-de2b55320e88 30Gi RWO k8s-block-replicated 70m
pgdata-cray-smd-postgres-1 Bound pvc-3d42598a-188e-4301-a58e-0f0ce3944c89 30Gi RWO k8s-block-replicated 27m
pgdata-cray-smd-postgres-2 Bound pvc-0d659080-7d39-409a-9ee5-1a1806971054 30Gi RWO k8s-block-replicated 27m
(ncn-mw#
) To resolve this SyncFailed
case, resize the pgdata
PVCs for the selected Postgres cluster. Create the following function in the shell and execute the function by
calling it with the appropriate arguments. For this example the pgdata-cray-smd-postgres
PVCs will be resized to 100Gi
to match that of the postgresql
cray-smd-postgres
volume size.
function resize-postgresql-pvc
{
POSTGRESQL=$1
PGDATA=$2
NAMESPACE=$3
PGRESIZE=$4
# Check for required arguments
if [ $# -ne 4 ]; then
echo "Illegal number of parameters ($#). Function requires exactly 4 arguments."
exit 2
fi
## Check that PGRESIZE matches current postgresql volume size
postgresql_volume_size=$(kubectl get postgresql "${POSTGRESQL}" -n "${NAMESPACE}" -o jsonpath="{.spec.volume.size}")
if [ "${postgresql_volume_size}" != "${PGRESIZE}" ]; then
echo "Invalid resize ${PGRESIZE}, expected ${postgresql_volume_size}"
exit 2
fi
## Scale the postgres cluster to 1 member
kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' -p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 1}]'
while [ $(kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -v NAME | wc -l) != 1 ] ; do
echo " waiting for pods to terminate"
sleep 2
done
## Delete the inactive PVCs, resize the active PVC, and wait for the resize to complete
kubectl delete pvc "${PGDATA}-1" "${PGDATA}-2" -n "${NAMESPACE}"
kubectl patch -p '{"spec": {"resources": {"requests": {"storage": "'${PGRESIZE}'"}}}}' "pvc/${PGDATA}-0" -n "${NAMESPACE}"
while [ -z '$(kubectl describe pvc "{PGDATA}-0" -n "${NAMESPACE}" | grep FileSystemResizeSuccessful' ] ; do
echo " waiting for PVC to resize"
sleep 2
done
## Scale the postgres cluster back to 3 members
kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' -p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 3}]'
while [ $(kubectl get pods -l "application=spilo,cluster-name=${POSTGRESQL}" -n "${NAMESPACE}" | grep -v NAME | grep -c "Running") != 3 ] ; do
echo " waiting for pods to restart"
sleep 2
done
}
resize-postgresql-pvc cray-smd-postgres pgdata-cray-smd-postgres services 100Gi
In order to persist any Postgres PVC storage volume size changes, it is necessary that this change also be made to the customer-managed customizations.yaml
file.
See the Postgres PVC Resize information in the Post Install Customizations document.
msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries"
This generally means that some state in the Postgres operator is out of sync with that of the postgresql
cluster, resulting in database connection issues.
To resolve this SyncFailed
case, restarting the Postgres operator by deleting the pod may clear up the issue.
-
(
ncn-mw#
) Delete the pod.kubectl delete pod -l app.kubernetes.io/name=postgres-operator -n services
-
(
ncn-mw#
) Wait for thepostgres-operator
to restart.kubectl get pods -l app.kubernetes.io/name=postgres-operator -n services
Example output of restarted pod:
NAME READY STATUS RESTARTS AGE cray-postgres-operator-6fffc48b4c-mqz7z 2/2 Running 0 6m
If the database connection has been down for a long period of time and the SyncFailed
persists after the above steps, a restart of
the cluster and the postgres-operator
may be needed for the service to reconnect to the Postgres cluster. For example, if the
cray-gitea
service is not able to connect to the Postgres database and the connection has been failing for many hours, restart
the cluster and operator.
-
(
ncn-mw#
) Set necessary variables.Replace the values of these variables for the appropriate ones for the particular cluster being remediated.
CLIENT=gitea-vcs POSTGRESQL=gitea-vcs-postgres NAMESPACE=services
-
(
ncn-mw#
) Scale the service to 0.kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=0
-
(
ncn-mw#
) Restart the Postgres cluster and thepostgres-operator
.kubectl delete pod "${POSTGRESQL}-0" "${POSTGRESQL}-1" "${POSTGRESQL}-2" -n ${NAMESPACE} kubectl delete pods -n services -lapp.kubernetes.io/name=postgres-operator while [ $(kubectl get postgresql ${POSTGRESQL} -n ${NAMESPACE} -o json | jq -r '.status.PostgresClusterStatus') != "Running" ]; do echo "waiting for ${POSTGRESQL} to start running"; sleep 2 done
-
(
ncn-mw#
) Scale the service back to 1 (for different services this may be to 3).kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=1
msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: pq: password authentication failed for user \<username\>"
This generally means that the password for the given user is not the same as that specified in the Kubernetes secret. This can occur if the postgresql
cluster was rebuilt and the data was restored, leaving the Kubernetes secrets out of sync with the Postgres cluster. To resolve this SyncFailed
case,
gather the username and password for the credential from Kubernetes, and update the database with these values. For example, if the user postgres
is
failing to authenticate between the cray-smd
services and the cray-smd-postgres
cluster, then get the password for the postgres
user from the
Kubernetes secret and update the password in the database.
-
(
ncn-mw#
) Set necessary variables.CLIENT=cray-smd POSTGRESQL=cray-smd-postgres NAMESPACE=services
-
(
ncn-mw#
) Scale the service to 0.kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=0 while [ $(kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name="${CLIENT}" | grep -v NAME | wc -l) != 0 ] ; do echo " waiting for pods to terminate" sleep 2 done
-
(
ncn-mw#
) Determine what secrets are associated with thepostgresql
credentials.kubectl get secrets -n ${NAMESPACE} | grep "${POSTGRESQL}.credentials"
Example output:
services hmsdsuser.cray-smd-postgres.credentials Opaque 2 31m services postgres.cray-smd-postgres.credentials Opaque 2 31m services service-account.cray-smd-postgres.credentials Opaque 2 31m services standby.cray-smd-postgres.credentials Opaque 2 31m
-
Gather the decoded username and password for the user that is failing to authenticate.
-
(
ncn-mw#
) Save the name of the secret with the failing authentication in a variable.Replace the secret name in this command with the secret determined in the previous step.
SECRET=postgres.cray-smd-postgres.credentials
-
(
ncn-mw#
) Decode the username.kubectl get secret ${SECRET} -n ${NAMESPACE} -ojsonpath='{.data.username}' | base64 -d
Example output:
postgres
-
(
ncn-mw#
) Decode the password.kubectl get secret ${SECRET} -n ${NAMESPACE} -ojsonpath='{.data.password}'| base64 -d
Example output:
ABCXYZ
-
-
Update the username and password in the database.
-
(
ncn-mw#
) Determine which pod is the leader.kubectl exec "${POSTGRESQL}-0" -n ${NAMESPACE} -c postgres -it -- patronictl list
Example output:
+-------------------+---------------------+------------+--------+---------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-------------------+---------------------+------------+--------+---------+----+-----------+ | cray-smd-postgres | cray-smd-postgres-0 | 10.42.0.25 | Leader | running | 1 | | | cray-smd-postgres | cray-smd-postgres-1 | 10.44.0.34 | | running | | 0 | | cray-smd-postgres | cray-smd-postgres-2 | 10.36.0.44 | | running | | 0 | +-------------------+---------------------+------------+--------+---------+----+-----------+
-
Update the username and password in the database in the Postgres leader container.
-
(
ncn-mw#
) Open a shell into the leader container.POSTGRES_LEADER=cray-smd-postgres-0 kubectl exec ${POSTGRES_LEADER} -n ${NAMESPACE} -c postgres -it -- bash
-
(
postgres-container#
) Open Postgres./usr/bin/psql postgres postgres
-
(
postgres#
) Update the username and password in the database.ALTER USER postgres WITH PASSWORD 'ABCXYZ'; ALTER ROLE
-
-
-
(
ncn-mw#
) Restart thepostgresql
cluster.kubectl delete pod "${POSTGRESQL}-0" "${POSTGRESQL}-1" "${POSTGRESQL}-2" -n ${NAMESPACE} while [ $(kubectl get postgresql ${POSTGRESQL} -n ${NAMESPACE} -o json | jq -r '.status.PostgresClusterStatus') != "Running" ]; do echo "waiting for ${POSTGRESQL} to start running" sleep 2 done
-
(
ncn-mw#
) Scale the service back to 3.kubectl scale deployment ${CLIENT} -n ${NAMESPACE} --replicas=3 while [ $(kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name="${CLIENT}" | grep -v NAME | wc -l) != 3 ] ; do echo " waiting for pods to start running" sleep 2 done
Most services expect to maintain a Postgres cluster consisting of three pods for resiliency (SMA is one exception where only two pods are expected to exist).
(ncn-mw#
) For a given Postgres cluster, check how many pods are running.
POSTGRESQL=keycloak-postgres
NAMESPACE=services
kubectl get pods -A -l "application=spilo,cluster-name=${POSTGRESQL}"
If the number of Postgres pods for the given cluster is more or less than expected, increase or decrease as needed. This example will patch
the keycloak-postgres
cluster resource so that three pods are running.
-
(
ncn-mw#
) Set thePOSTGRESQL
andNAMESPACE
variables.POSTGRESQL=keycloak-postgres NAMESPACE=services
-
(
ncn-mw#
) Patch thekeycloak-postgres
cluster resource to ensure three pods are running.kubectl patch postgresql "${POSTGRESQL}" -n "${NAMESPACE}" --type='json' \ -p='[{"op" : "replace", "path":"/spec/numberOfInstances", "value" : 3}]'
-
(
ncn-mw#
) Confirm the number of cluster members, otherwise known as pods, by checking thepostgresql
resource.kubectl get postgresql ${POSTGRESQL} -n ${NAMESPACE}
Example output:
NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS keycloak-postgres keycloak 11 3 10Gi 29m Running
-
If a pod is starting but remains in
Pending
,CrashLoopBackOff
,ImagePullBackOff
, or other non-Running
states, then describe the pod and get logs from the pod for further analysis.-
(
ncn-mw#
) Find the pod name.kubectl get pods -A -l "application=spilo,cluster-name=${POSTGRESQL}"
Example output:
NAMESPACE NAME READY STATUS RESTARTS AGE services keycloak-postgres-0 0/3 Pending 0 36m services keycloak-postgres-1 3/3 Running 0 35m services keycloak-postgres-2 3/3 Running 0 34m
-
(
ncn-mw#
) Describe the pod.kubectl describe pod "${POSTGRESQL}-0" -n ${NAMESPACE}
-
(
ncn-mw#
) View the pod logs.kubectl logs "${POSTGRESQL}-0" -c postgres -n ${NAMESPACE}
-
If a Postgres cluster no longer has a leader, the database will need to be recovered.
-
(
ncn-mw#
) Set thePOSTGRESQL
andNAMESPACE
variables.POSTGRESQL=cray-smd-postgres NAMESPACE=services
-
(
ncn-mw#
) Check if the leader is missing.kubectl exec ${POSTGRESQL}-0 -n ${NAMESPACE} -c postgres -- patronictl list
Example output:
+-------------------+---------------------+------------+------+--------------+----+-----------+ | Cluster | Member | Host | Role | State | TL | Lag in MB | +-------------------+---------------------+------------+------+--------------+----+-----------+ | cray-smd-postgres | cray-smd-postgres-0 | 10.42.0.25 | | running | | unknown | | cray-smd-postgres | cray-smd-postgres-1 | 10.44.0.34 | | start failed | | unknown | | cray-smd-postgres | cray-smd-postgres-2 | 10.36.0.44 | | start failed | | unknown | +-------------------+---------------------+------------+------+--------------+----+-----------+
If the output does not list a leader, then proceed to Recover from a missing Postgres leader.
-
Check if there is conflicting leader information.
It sometimes happen that the above check reports a leader, but other checks report no leader, or report conflicting leader information. The following steps show the status reported by each member of the cluster.
-
(
ncn-mw#
) Make a list of the Kubernetes pods of the cluster members.PODS=$(kubectl get pods -n ${NAMESPACE} --no-headers -o custom-columns=:.metadata.name | grep "^${POSTGRESQL}-[0-9]$") ; echo ${PODS}
Example output:
cray-smd-postgres-0 cray-smd-postgres-1 cray-smd-postgres-2
-
(
ncn-mw#
) Query each pod about the status of the cluster.This script reports the cluster status as perceived by each member of the cluster.
for POD in ${PODS} ; do echo "Checking ${POD}..." kubectl exec ${POD} -n ${NAMESPACE} -c postgres -- curl -s http://localhost:8008/cluster | jq ; echo done
Example output:
Checking cray-smd-postgres-0...
{ "members": [ { "name": "cray-smd-postgres-0", "role": "leader", "state": "running", "api_url": "http://10.32.0.33:8008/patroni", "host": "10.32.0.33", "port": 5432, "timeline": 1 }, { "name": "cray-smd-postgres-1", "role": "replica", "state": "running", "api_url": "http://10.44.0.30:8008/patroni", "host": "10.44.0.30", "port": 5432, "timeline": 1, "lag": 0 }, { "name": "cray-smd-postgres-2", "role": "replica", "state": "running", "api_url": "http://10.47.0.33:8008/patroni", "host": "10.47.0.33", "port": 5432, "timeline": 1, "lag": 0 } ] }
Checking cray-smd-postgres-1...
{ "members": [ { "name": "cray-smd-postgres-0", "role": "leader", "state": "running", "api_url": "http://10.32.0.33:8008/patroni", "host": "10.32.0.33", "port": 5432, "timeline": 1 }, { "name": "cray-smd-postgres-1", "role": "replica", "state": "running", "api_url": "http://10.44.0.30:8008/patroni", "host": "10.44.0.30", "port": 5432, "timeline": 1, "lag": 0 }, { "name": "cray-smd-postgres-2", "role": "replica", "state": "running", "api_url": "http://10.47.0.33:8008/patroni", "host": "10.47.0.33", "port": 5432, "timeline": 1, "lag": 0 } ] }
And so on for every member of the cluster. This script does not do any checking -- it only displays the information.
-
Check the output for errors or inconsistencies.
In particular, validate the following:
- Every cluster member reports exactly one leader.
- Every cluster member reports the same leader.
- Every cluster member reports the same states for each member.
If any of the above are not true, this indicates that the cluster members are no longer properly synchronized. In this case, attempt the Recover replication remediation procedures.
-
See the Recover from Postgres WAL Event procedure.