-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
We have the problem that all db cluster replicas are permanently recreated by the operator after a certain amount of time. We have figured out that the problem is a changing pod spec within the db cluster statefulsets:
time="2020-04-21T06:22:17Z" level=debug msg="spec diff between old and new statefulsets:
Template.Spec.Containers[0].TerminationMessagePath: \"/dev/termination-log\" != \"\"
Template.Spec.Containers[0].TerminationMessagePolicy: \"File\" != \"\"
[!!!] Template.Spec.Containers[1].Name: \"postgres-exporter\" != \"filebeat\"
[!!!] Template.Spec.Containers[1].Image: \"our.registry.com/pg-exporter:latest-60eaf1c8\" != \"our.registry.com/filebeat:7.5.1-60eaf1c8\"
Template.Spec.Containers[1].TerminationMessagePath: \"/dev/termination-log\" != \"\"
Template.Spec.Containers[1].TerminationMessagePolicy: \"File\" != \"\"
[!!!] Template.Spec.Containers[2].Name: \"filebeat\" != \"postgres-exporter\"
[!!!] Template.Spec.Containers[2].Image: \"our.registry.com/filebeat:7.5.1-60eaf1c8\" != \"our.registry.com/pg-exporter:latest-60eaf1c8\"
Template.Spec.Containers[2].TerminationMessagePath: \"/dev/termination-log\" != \"\"
Template.Spec.Containers[2].TerminationMessagePolicy: \"File\" != \"\"
Template.Spec.RestartPolicy: \"Always\" != \"\"
Template.Spec.DNSPolicy: \"ClusterFirst\" != \"\"
Template.Spec.DeprecatedServiceAccount: \"postgres-pod\" != \"\"
Template.Spec.SchedulerName: \"default-scheduler\" != \"\"
Template.Spec.Tolerations: []v1.Toleration(nil) != []v1.Toleration{}
VolumeClaimTemplates[0].Status.Phase: \"Pending\" != \"\"
RevisionHistoryLimit: &int32(10) != nil
" cluster-name=postgres-sandbox/acid-minimal-cluster pkg=cluster worker=1
In concrete terms the order of our with sidecar_docker_images globally configured sidecars (in our case filebeat + Postgres exporter) permanently changes within the pod spec.
We spent some time on analyzing https://github.com/zalando/postgres-operator/blob/master/pkg/cluster/k8sres.go and our assumption is that the map sidecar_docker_image
postgres-operator/pkg/util/config/config.go
Line 114 in a1f2bd0
| Sidecars map[string]string `name:"sidecar_docker_images"` |
operatorconfiguration CR should be the problem here. We are no Go experts but we figured out that the merging of global and cluster specific sidecars by the function mergeSidecars() occurs in a random order. Because in our case we have no cluster specific sidecars in cluster manifests configured we could notice this behavior in the for loop that iterates over global sidecars (OpConfig.Sidecars): postgres-operator/pkg/cluster/k8sres.go
Line 1236 in 3c91bde
| for name, dockerImage := range c.OpConfig.Sidecars { |
We have temporary hot fixed that issue by expanding the function mergeSidecars()
postgres-operator/pkg/cluster/k8sres.go
Line 1220 in 3c91bde
| func (c *Cluster) mergeSidecars(sidecars []acidv1.Sidecar) []acidv1.Sidecar { |
result before returning it. If interested I can share the fix.
Is our assumption correct? What would be the more elegant solution here?