Vault Secrets Operator Manager being OOM killed on busy OpenShift cluster

**Describe the bug**
We are currently using Vault Secrets Operator in our clusters. We have a specific cluster that gets more customer volume than the others and have recently noticed that the `vault-secrets-operator-manager` pod is being OOM killed after reaching the memory limits outlined in the Operators CSV.

Snippet from the `.status` key in the OOMkilled pods yaml:

```yaml
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T10:54:46Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T13:24:28Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T13:24:28Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T10:54:46Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://6ce4d44fa30e22966dbb10cc3ae1dc0df05daf5ea76942d225300b2c9fc2b982
    image: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:8ae1e417a40fb2df575e170128267a4399f56b6bac6db8b30c5b5e2698d0e6f2
    imageID: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:34402817de5c30fb0a2ae0055abce343bd9f84d37ad6cd4dd62820a54aeabfef
    lastState: {}
    name: kube-rbac-proxy
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-10-09T10:55:38Z"
  - containerID: cri-o://d80dfb0ca666279c66e96062bace1353ec58ea4ebc4285ba9d7bd96b3ca2ef2f
    image: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
    imageID: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
    lastState:
      terminated:
        containerID: cri-o://d80dfb0ca666279c66e96062bace1353ec58ea4ebc4285ba9d7bd96b3ca2ef2f
        exitCode: 137
        finishedAt: "2024-10-09T13:24:27Z"
        reason: OOMKilled
        startedAt: "2024-10-09T13:24:10Z"
```

**To Reproduce**
Steps to reproduce the behavior:
1. Deploy vault secrets operator in OpenShift
2. Make heavy usage of the operator (currently we have 361 static secrets being synced via the operator in this cluster)
4. Vault secrets operator manager pod begins to enter crash loop, yaml of the pod indicated the reason is OOMkilled.

Application deployment:
N/A

**Expected behavior**
CSV for the operator has enough head room in its memory limits to avoid out of memory issues in pod.

**Environment**
* Kubernetes version: 
  *  Openshift 4.14.10
* vault-secrets-operator version:
  * v0.5.1

**Additional context**
We have been able to temporarily work around this issue by manually doubling the limits memory value for the manager container in the CSV (from 256Mi to 512Mi) at key `.spec.install.spec.deployments[].spec.template.spec.containers[]`.

```yaml
- args:
    - --health-probe-bind-address=:8081
    - --metrics-bind-address=127.0.0.1:8080
    - --leader-elect
  command:
    - /vault-secrets-operator
  env:
    - name: OPERATOR_POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: OPERATOR_POD_UID
      valueFrom:
        fieldRef:
          fieldPath: metadata.uid
  image: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
  imagePullPolicy: IfNotPresent
  livenessProbe:
    httpGet:
      path: /healthz
      port: 8081
    initialDelaySeconds: 15
    periodSeconds: 20
  name: manager
  readinessProbe:
    httpGet:
      path: /readyz
      port: 8081
    initialDelaySeconds: 5
    periodSeconds: 10
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 10m
      memory: 128Mi
  securityContext:
    allowPrivilegeEscalation: false
  volumeMounts:
    - mountPath: /var/run/podinfo
      name: podinfo
```

This is not a permanent fix though since re-installing / upgrading the operator will re-instate the original memory value. We are installing via the Operator Hub in Openshift so do not have a way to permanently affect this value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vault Secrets Operator Manager being OOM killed on busy OpenShift cluster #949

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vault Secrets Operator Manager being OOM killed on busy OpenShift cluster #949

Description

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions