Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with overriding statefulset readiness probe #1698

Open
bkelava opened this issue Aug 13, 2024 · 5 comments
Open

Problem with overriding statefulset readiness probe #1698

bkelava opened this issue Aug 13, 2024 · 5 comments
Labels
bug Something isn't working stale Issue or PR with long period of inactivity

Comments

@bkelava
Copy link

bkelava commented Aug 13, 2024

Describe the bug

Overriding stateful set readiness probe from tcpSocket to exec keeps tcpSocket in its config.

To Reproduce

kubectl apply -f cluster-test.yml

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq-test
spec:
  replicas: 5
  image: 172.17.12.132:9110/rabbitmq/rabbitmq:3.13.4-management
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 1
      memory: 2Gi
  persistence:
    storageClassName: nfs-rabbitmq-test-storage
    storage: "10Gi"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - servisi-0023
            - servisi-0024
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
              - name: rabbitmq
                livenessProbe:
                  exec:
                    command:
                      - rabbitmq-diagnostics
                      - status
                  initialDelaySeconds: 60
                  periodSeconds: 60
                  timeoutSeconds: 15
                readinessProbe:
                  exec:
                    command:
                    - rabbitmq-diagnostics
                    - ping
                  initialDelaySeconds: 20
                  periodSeconds: 60
                  timeoutSeconds: 10
                securityContext:
                  allowPrivilegeEscalation: false
                  capabilities:
                    add:
                      - CHOWN
                  privileged: false
                  procMount: Default
                  readOnlyRootFilesystem: false
                  runAsNonRoot: false
                  runAsUser: 999
                  runAsGroup: 100
                volumeMounts:
                  - name: definitions-json
                    mountPath: /etc/rabbitmq/definitions.json
                    subPath: definitions.json
                  - name: rabbitmq-conf
                    mountPath: /etc/rabbitmq/rabbitmq.conf
                    subPath: rabbitmq.conf
                      #- name: rabbitmq-data
                      #mountPath: /var/lib/rabbitmq
            securityContext:
              fsGroup: 100
              runAsNonRoot: true
              runAsUser: 999
              runAsGroup: 100
            volumes:
              - name: definitions-json
                configMap:
                  name: rabbitmq-configmap
                  items:
                    - key: definitions.json
                      path: definitions.json
              - name: rabbitmq-conf
                configMap:
                  name: rabbitmq-configmap
                  items:
                    - key: rabbitmq.conf
                      path: rabbitmq.conf
                        #volumeClaimTemplates:
                        #- metadata:
                        #name: rabbitmq-data
                        #  annotations:
                        # volume.alpha.kubernetes.io/storage-class: nfs-rabbitmq-test-storage
                        #      spec:
                        #     accessModes:
                        #  - ReadWriteOnce
                        #  storageClassName: nfs-rabbitmq-test-storage
                        #  resources:
                        #       requests:
                        #        storage: 10Gi

kubectl get statefulset rabbitmq-test-server -o yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    rabbitmq.com/createdAt: "2024-08-13T09:42:31Z"
  creationTimestamp: "2024-08-13T09:42:31Z"
  generation: 1
  labels:
    app.kubernetes.io/component: rabbitmq
    app.kubernetes.io/name: rabbitmq-test
    app.kubernetes.io/part-of: rabbitmq
  name: rabbitmq-test-server
  namespace: rabbitmq-test
  ownerReferences:
  - apiVersion: rabbitmq.com/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: RabbitmqCluster
    name: rabbitmq-test
    uid: 073ca32b-3fb0-4c92-a0b5-b840c679e36a
  resourceVersion: "23728935"
  uid: 704acd08-39cd-4507-b731-9d4f66c1813c
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: Parallel
  replicas: 5
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: rabbitmq-test
  serviceName: rabbitmq-test-nodes
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: rabbitmq
        app.kubernetes.io/name: rabbitmq-test
        app.kubernetes.io/part-of: rabbitmq
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - servisi-0023
                - servisi-0024
      automountServiceAccountToken: true
      containers:
      - env:
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: K8S_SERVICE_NAME
          value: rabbitmq-test-nodes
        - name: RABBITMQ_ENABLED_PLUGINS_FILE
          value: /operator/enabled_plugins
        - name: RABBITMQ_USE_LONGNAME
          value: "true"
        - name: RABBITMQ_NODENAME
          value: rabbit@$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE)
        - name: K8S_HOSTNAME_SUFFIX
          value: .$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE)
        image: 172.17.12.132:9110/rabbitmq/rabbitmq:3.13.4-management
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - if [ ! -z "$(cat /etc/pod-info/skipPreStopChecks)" ]; then exit 0;
                fi; rabbitmq-upgrade await_online_quorum_plus_one -t 604800 && rabbitmq-upgrade
                await_online_synchronized_mirror -t 604800 && rabbitmq-upgrade drain
                -t 604800
        livenessProbe:
          exec:
            command:
            - rabbitmq-diagnostics
            - status
          failureThreshold: 3
          initialDelaySeconds: 60
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 15
        name: rabbitmq
        ports:
        - containerPort: 4369
          name: epmd
          protocol: TCP
        - containerPort: 5672
          name: amqp
          protocol: TCP
        - containerPort: 15672
          name: management
          protocol: TCP
        - containerPort: 15692
          name: prometheus
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - rabbitmq-diagnostics
            - ping
          failureThreshold: 3
          initialDelaySeconds: 20
          periodSeconds: 60
          successThreshold: 1
          tcpSocket:
            port: amqp
          timeoutSeconds: 10
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 1Gi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - CHOWN
          privileged: false
          procMount: Default
          readOnlyRootFilesystem: false
          runAsGroup: 100
          runAsNonRoot: false
          runAsUser: 999
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/rabbitmq/
          name: rabbitmq-erlang-cookie
        - mountPath: /var/lib/rabbitmq/mnesia/
          name: persistence
        - mountPath: /etc/rabbitmq/definitions.json
          name: definitions-json
          subPath: definitions.json
        - mountPath: /etc/rabbitmq/rabbitmq.conf
          name: rabbitmq-conf
          subPath: rabbitmq.conf
        - mountPath: /operator
          name: rabbitmq-plugins
        - mountPath: /etc/rabbitmq/conf.d/10-operatorDefaults.conf
          name: rabbitmq-confd
          subPath: operatorDefaults.conf
        - mountPath: /etc/rabbitmq/conf.d/90-userDefinedConfiguration.conf
          name: rabbitmq-confd
          subPath: userDefinedConfiguration.conf
        - mountPath: /etc/pod-info/
          name: pod-info
        - mountPath: /etc/rabbitmq/conf.d/11-default_user.conf
          name: rabbitmq-confd
          subPath: default_user.conf
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - sh
        - -c
        - cp /tmp/erlang-cookie-secret/.erlang.cookie /var/lib/rabbitmq/.erlang.cookie
          && chmod 600 /var/lib/rabbitmq/.erlang.cookie ; cp /tmp/rabbitmq-plugins/enabled_plugins
          /operator/enabled_plugins ; echo '[default]' > /var/lib/rabbitmq/.rabbitmqadmin.conf
          && sed -e 's/default_user/username/' -e 's/default_pass/password/' /tmp/default_user.conf
          >> /var/lib/rabbitmq/.rabbitmqadmin.conf && chmod 600 /var/lib/rabbitmq/.rabbitmqadmin.conf
          ; sleep 30
        image: 172.17.12.132:9110/rabbitmq/rabbitmq:3.13.4-management
        imagePullPolicy: IfNotPresent
        name: setup-container
        resources:
          limits:
            cpu: 100m
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 500Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp/rabbitmq-plugins/
          name: plugins-conf
        - mountPath: /var/lib/rabbitmq/
          name: rabbitmq-erlang-cookie
        - mountPath: /tmp/erlang-cookie-secret/
          name: erlang-cookie-secret
        - mountPath: /operator
          name: rabbitmq-plugins
        - mountPath: /var/lib/rabbitmq/mnesia/
          name: persistence
        - mountPath: /tmp/default_user.conf
          name: rabbitmq-confd
          subPath: default_user.conf
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 100
        runAsGroup: 100
        runAsNonRoot: true
        runAsUser: 999
      serviceAccount: rabbitmq-test-server
      serviceAccountName: rabbitmq-test-server
      terminationGracePeriodSeconds: 604800
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: rabbitmq-test
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: definitions.json
            path: definitions.json
          name: rabbitmq-configmap
        name: definitions-json
      - configMap:
          defaultMode: 420
          items:
          - key: rabbitmq.conf
            path: rabbitmq.conf
          name: rabbitmq-configmap
        name: rabbitmq-conf
      - configMap:
          defaultMode: 420
          name: rabbitmq-test-plugins-conf
        name: plugins-conf
      - name: rabbitmq-confd
        projected:
          defaultMode: 420
          sources:
          - configMap:
              items:
              - key: operatorDefaults.conf
                path: operatorDefaults.conf
              - key: userDefinedConfiguration.conf
                path: userDefinedConfiguration.conf
              name: rabbitmq-test-server-conf
          - secret:
              items:
              - key: default_user.conf
                path: default_user.conf
              name: rabbitmq-test-default-user
      - emptyDir: {}
        name: rabbitmq-erlang-cookie
      - name: erlang-cookie-secret
        secret:
          defaultMode: 420
          secretName: rabbitmq-test-erlang-cookie
      - emptyDir: {}
        name: rabbitmq-plugins
      - downwardAPI:
          defaultMode: 420
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.labels['skipPreStopChecks']
            path: skipPreStopChecks
        name: pod-info
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: rabbitmq
        app.kubernetes.io/name: rabbitmq-test
        app.kubernetes.io/part-of: rabbitmq
      name: persistence
      namespace: rabbitmq-test
      ownerReferences:
      - apiVersion: rabbitmq.com/v1beta1
        blockOwnerDeletion: false
        controller: true
        kind: RabbitmqCluster
        name: rabbitmq-test
        uid: 073ca32b-3fb0-4c92-a0b5-b840c679e36a
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: nfs-rabbitmq-test-storage
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 0
  collisionCount: 0
  currentRevision: rabbitmq-test-server-5b4fd5484d
  observedGeneration: 1
  replicas: 0
  updateRevision: rabbitmq-test-server-5b4fd5484d

statefulset did not override readiness probe but keeps both exec and tcpSocket configs as follows:

        readinessProbe:
          exec:
            command:
            - rabbitmq-diagnostics
            - ping
          failureThreshold: 3
          initialDelaySeconds: 20
          periodSeconds: 60
          successThreshold: 1
          tcpSocket:
            port: amqp
          timeoutSeconds: 10

which results in error

Events:
  Type     Reason            Age                    From                    Message
  ----     ------            ----                   ----                    -------
  Normal   SuccessfulCreate  9m31s                  statefulset-controller  create Claim persistence-rabbitmq-test-server-0 Pod rabbitmq-test-server-0 in StatefulSet rabbitmq-test-server success
  Warning  FailedCreate      4m4s (x17 over 9m31s)  statefulset-controller  create Pod rabbitmq-test-server-0 in StatefulSet rabbitmq-test-server failed error: Pod "rabbitmq-test-server-0" is invalid: spec.containers[0].readinessProbe.tcpSocket: Forbidden: may not specify more than 1 handler type

patching stateful set is an option to fix but it is not ideal!, please help.

@bkelava bkelava added the bug Something isn't working label Aug 13, 2024
@mkuratczyk
Copy link
Collaborator

While allowing the probe to be overriden is something we can consider, can you explain what you are trying to accomplish here? Why do you expect rabbitmq-diagnostics ping to be a better readiness probe? What are the situations where it would be better?

@sudhirjena
Copy link

sudhirjena commented Sep 11, 2024

@mkuratczyk, we are facing the same issue with overriding readinessProbe.initialDelaySeconds. We are deploying rabbitmq on EKS + Fargate cluster and the intrinsic scheduling takes about 100 seconds. With the default for readinessProbe.initialDelaySeconds as 10s, we face the error everytime the rabbitmq pod is scheduled:

Readiness probe failed: dial tcp 10.35.177.155:5672: connect: connection refused

@bkelava
Copy link
Author

bkelava commented Sep 11, 2024

@sudhirjena

I've temporary fixed error by commenting readinessProbe as follows:

...
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
              - name: rabbitmq
                livenessProbe:
                  exec:
                    command:
                      - rabbitmq-diagnostics
                      - status
                  initialDelaySeconds: 60
                  periodSeconds: 60
                  timeoutSeconds: 15
                # readinessProbe:
                #   tcpSocket:
                #     port: 22
                #   # exec:
                #   #   command:
                #   #   - rabbitmq-diagnostics
                #   #   - ping
                #   initialDelaySeconds: 20
                #   periodSeconds: 60
                #   timeoutSeconds: 10
                securityContext:
                  allowPrivilegeEscalation: false
                  capabilities:
                    add:
                      - CHOWN
                  privileged: false
                  procMount: Default
                  readOnlyRootFilesystem: false
                  runAsNonRoot: false
                  runAsUser: 999
                  runAsGroup: 100
...

Cluster has started without errors

NAME                         READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
pod/rabbitmq-test-server-0   1/1     Running   0          7d13h   10.33.128.2   servisi-0023   <none>           <none>
pod/rabbitmq-test-server-1   1/1     Running   0          7d13h   10.33.128.3   servisi-0023   <none>           <none>
pod/rabbitmq-test-server-2   1/1     Running   0          7d13h   10.35.128.3   servisi-0024   <none>           <none>
pod/rabbitmq-test-server-3   1/1     Running   0          7d13h   10.33.128.4   servisi-0023   <none>           <none>
pod/rabbitmq-test-server-4   1/1     Running   0          7d13h   10.35.128.2   servisi-0024   <none>           <none>

NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                        AGE   SELECTOR
service/rabbitmq-test         ClusterIP   10.245.98.250   <none>        5672/TCP,15672/TCP,15692/TCP   27d   app.kubernetes.io/name=rabbitmq-test
service/rabbitmq-test-nodes   ClusterIP   None            <none>        4369/TCP,25672/TCP             27d   app.kubernetes.io/name=rabbitmq-test

But as always, temporary solution might be a permanent one 🥇

@mkuratczyk
Copy link
Collaborator

We are not against the idea, so PRs welcome. This is an open source project, you don't have to wait for us to get around to implementing this.

Copy link

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

@github-actions github-actions bot added the stale Issue or PR with long period of inactivity label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issue or PR with long period of inactivity
Projects
None yet
Development

No branches or pull requests

3 participants