Skip to content

Vault Secrets Operator Manager being OOM killed on busy OpenShift cluster #949

Open
@kennedn

Description

Describe the bug
We are currently using Vault Secrets Operator in our clusters. We have a specific cluster that gets more customer volume than the others and have recently noticed that the vault-secrets-operator-manager pod is being OOM killed after reaching the memory limits outlined in the Operators CSV.

Snippet from the .status key in the OOMkilled pods yaml:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T10:54:46Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T13:24:28Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T13:24:28Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T10:54:46Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://6ce4d44fa30e22966dbb10cc3ae1dc0df05daf5ea76942d225300b2c9fc2b982
    image: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:8ae1e417a40fb2df575e170128267a4399f56b6bac6db8b30c5b5e2698d0e6f2
    imageID: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:34402817de5c30fb0a2ae0055abce343bd9f84d37ad6cd4dd62820a54aeabfef
    lastState: {}
    name: kube-rbac-proxy
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-10-09T10:55:38Z"
  - containerID: cri-o://d80dfb0ca666279c66e96062bace1353ec58ea4ebc4285ba9d7bd96b3ca2ef2f
    image: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
    imageID: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
    lastState:
      terminated:
        containerID: cri-o://d80dfb0ca666279c66e96062bace1353ec58ea4ebc4285ba9d7bd96b3ca2ef2f
        exitCode: 137
        finishedAt: "2024-10-09T13:24:27Z"
        reason: OOMKilled
        startedAt: "2024-10-09T13:24:10Z"

To Reproduce
Steps to reproduce the behavior:

  1. Deploy vault secrets operator in OpenShift
  2. Make heavy usage of the operator (currently we have 361 static secrets being synced via the operator in this cluster)
  3. Vault secrets operator manager pod begins to enter crash loop, yaml of the pod indicated the reason is OOMkilled.

Application deployment:
N/A

Expected behavior
CSV for the operator has enough head room in its memory limits to avoid out of memory issues in pod.

Environment

  • Kubernetes version:
    • Openshift 4.14.10
  • vault-secrets-operator version:
    • v0.5.1

Additional context
We have been able to temporarily work around this issue by manually doubling the limits memory value for the manager container in the CSV (from 256Mi to 512Mi) at key .spec.install.spec.deployments[].spec.template.spec.containers[].

- args:
    - --health-probe-bind-address=:8081
    - --metrics-bind-address=127.0.0.1:8080
    - --leader-elect
  command:
    - /vault-secrets-operator
  env:
    - name: OPERATOR_POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: OPERATOR_POD_UID
      valueFrom:
        fieldRef:
          fieldPath: metadata.uid
  image: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
  imagePullPolicy: IfNotPresent
  livenessProbe:
    httpGet:
      path: /healthz
      port: 8081
    initialDelaySeconds: 15
    periodSeconds: 20
  name: manager
  readinessProbe:
    httpGet:
      path: /readyz
      port: 8081
    initialDelaySeconds: 5
    periodSeconds: 10
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 10m
      memory: 128Mi
  securityContext:
    allowPrivilegeEscalation: false
  volumeMounts:
    - mountPath: /var/run/podinfo
      name: podinfo

This is not a permanent fix though since re-installing / upgrading the operator will re-instate the original memory value. We are installing via the Operator Hub in Openshift so do not have a way to permanently affect this value.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmemory usageIssues with memory consumption by the operator Pod

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions