Skip to content

Permanent recreation of db clusters due to changing sidecar order #924

@siku4

Description

@siku4

We have the problem that all db cluster replicas are permanently recreated by the operator after a certain amount of time. We have figured out that the problem is a changing pod spec within the db cluster statefulsets:

time="2020-04-21T06:22:17Z" level=debug msg="spec diff between old and new statefulsets: 
Template.Spec.Containers[0].TerminationMessagePath: \"/dev/termination-log\" != \"\"
Template.Spec.Containers[0].TerminationMessagePolicy: \"File\" != \"\"
[!!!] Template.Spec.Containers[1].Name: \"postgres-exporter\" != \"filebeat\"
[!!!] Template.Spec.Containers[1].Image: \"our.registry.com/pg-exporter:latest-60eaf1c8\" != \"our.registry.com/filebeat:7.5.1-60eaf1c8\"
Template.Spec.Containers[1].TerminationMessagePath: \"/dev/termination-log\" != \"\"
Template.Spec.Containers[1].TerminationMessagePolicy: \"File\" != \"\"
[!!!] Template.Spec.Containers[2].Name: \"filebeat\" != \"postgres-exporter\"
[!!!] Template.Spec.Containers[2].Image: \"our.registry.com/filebeat:7.5.1-60eaf1c8\" != \"our.registry.com/pg-exporter:latest-60eaf1c8\"
Template.Spec.Containers[2].TerminationMessagePath: \"/dev/termination-log\" != \"\"
Template.Spec.Containers[2].TerminationMessagePolicy: \"File\" != \"\"
Template.Spec.RestartPolicy: \"Always\" != \"\"
Template.Spec.DNSPolicy: \"ClusterFirst\" != \"\"
Template.Spec.DeprecatedServiceAccount: \"postgres-pod\" != \"\"
Template.Spec.SchedulerName: \"default-scheduler\" != \"\"
Template.Spec.Tolerations: []v1.Toleration(nil) != []v1.Toleration{}
VolumeClaimTemplates[0].Status.Phase: \"Pending\" != \"\"
RevisionHistoryLimit: &int32(10) != nil
" cluster-name=postgres-sandbox/acid-minimal-cluster pkg=cluster worker=1

In concrete terms the order of our with sidecar_docker_images globally configured sidecars (in our case filebeat + Postgres exporter) permanently changes within the pod spec.

We spent some time on analyzing https://github.com/zalando/postgres-operator/blob/master/pkg/cluster/k8sres.go and our assumption is that the map sidecar_docker_image

Sidecars map[string]string `name:"sidecar_docker_images"`
of the operatorconfiguration CR should be the problem here. We are no Go experts but we figured out that the merging of global and cluster specific sidecars by the function mergeSidecars() occurs in a random order. Because in our case we have no cluster specific sidecars in cluster manifests configured we could notice this behavior in the for loop that iterates over global sidecars (OpConfig.Sidecars):
for name, dockerImage := range c.OpConfig.Sidecars {
To our knowledge the iteration order over a map is not guaranteed to be reproducible.

We have temporary hot fixed that issue by expanding the function mergeSidecars()

func (c *Cluster) mergeSidecars(sidecars []acidv1.Sidecar) []acidv1.Sidecar {
with a simple alphabetically sorting of the sidecar objects in result before returning it. If interested I can share the fix.

Is our assumption correct? What would be the more elegant solution here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions