Permanent recreation of db clusters due to changing sidecar order

We have the problem that **all db cluster replicas are permanently recreated** by the operator after a certain amount of time. We have figured out that the problem is a changing pod spec within the db cluster statefulsets:

```
time="2020-04-21T06:22:17Z" level=debug msg="spec diff between old and new statefulsets: 
Template.Spec.Containers[0].TerminationMessagePath: \"/dev/termination-log\" != \"\"
Template.Spec.Containers[0].TerminationMessagePolicy: \"File\" != \"\"
[!!!] Template.Spec.Containers[1].Name: \"postgres-exporter\" != \"filebeat\"
[!!!] Template.Spec.Containers[1].Image: \"our.registry.com/pg-exporter:latest-60eaf1c8\" != \"our.registry.com/filebeat:7.5.1-60eaf1c8\"
Template.Spec.Containers[1].TerminationMessagePath: \"/dev/termination-log\" != \"\"
Template.Spec.Containers[1].TerminationMessagePolicy: \"File\" != \"\"
[!!!] Template.Spec.Containers[2].Name: \"filebeat\" != \"postgres-exporter\"
[!!!] Template.Spec.Containers[2].Image: \"our.registry.com/filebeat:7.5.1-60eaf1c8\" != \"our.registry.com/pg-exporter:latest-60eaf1c8\"
Template.Spec.Containers[2].TerminationMessagePath: \"/dev/termination-log\" != \"\"
Template.Spec.Containers[2].TerminationMessagePolicy: \"File\" != \"\"
Template.Spec.RestartPolicy: \"Always\" != \"\"
Template.Spec.DNSPolicy: \"ClusterFirst\" != \"\"
Template.Spec.DeprecatedServiceAccount: \"postgres-pod\" != \"\"
Template.Spec.SchedulerName: \"default-scheduler\" != \"\"
Template.Spec.Tolerations: []v1.Toleration(nil) != []v1.Toleration{}
VolumeClaimTemplates[0].Status.Phase: \"Pending\" != \"\"
RevisionHistoryLimit: &int32(10) != nil
" cluster-name=postgres-sandbox/acid-minimal-cluster pkg=cluster worker=1
```

In concrete terms the order of our with `sidecar_docker_images` globally configured sidecars (in our case filebeat + Postgres exporter) permanently changes within the pod spec.

We spent some time on analyzing https://github.com/zalando/postgres-operator/blob/master/pkg/cluster/k8sres.go and our assumption is that the map `sidecar_docker_image` https://github.com/zalando/postgres-operator/blob/a1f2bd05b978b4dac384eb14a06a39f590cf5f57/pkg/util/config/config.go#L114 of the `operatorconfiguration` CR should be the problem here. We are no Go experts but we figured out that the merging of global and cluster specific sidecars by the function `mergeSidecars()` occurs in a random order. Because in our case we have no cluster specific sidecars in cluster manifests configured we could notice this behavior in the `for` loop that iterates over global sidecars (`OpConfig.Sidecars`): https://github.com/zalando/postgres-operator/blob/3c91bdeffadb5ec736a63548b5ffa08517c59de8/pkg/cluster/k8sres.go#L1236 To our knowledge the iteration order over a map is not guaranteed to be reproducible. 

We have temporary hot fixed that issue by expanding the function `mergeSidecars()` https://github.com/zalando/postgres-operator/blob/3c91bdeffadb5ec736a63548b5ffa08517c59de8/pkg/cluster/k8sres.go#L1220 with a simple alphabetically sorting of the sidecar objects in `result` before returning it. If interested I can share the fix.

Is our assumption correct? What would be the more elegant solution here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Permanent recreation of db clusters due to changing sidecar order #924

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Permanent recreation of db clusters due to changing sidecar order #924

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions