Description
Describe the bug
We are currently using Vault Secrets Operator in our clusters. We have a specific cluster that gets more customer volume than the others and have recently noticed that the vault-secrets-operator-manager
pod is being OOM killed after reaching the memory limits outlined in the Operators CSV.
Snippet from the .status
key in the OOMkilled pods yaml:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-10-09T10:54:46Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-10-09T13:24:28Z"
message: 'containers with unready status: [manager]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-10-09T13:24:28Z"
message: 'containers with unready status: [manager]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-10-09T10:54:46Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: cri-o://6ce4d44fa30e22966dbb10cc3ae1dc0df05daf5ea76942d225300b2c9fc2b982
image: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:8ae1e417a40fb2df575e170128267a4399f56b6bac6db8b30c5b5e2698d0e6f2
imageID: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:34402817de5c30fb0a2ae0055abce343bd9f84d37ad6cd4dd62820a54aeabfef
lastState: {}
name: kube-rbac-proxy
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2024-10-09T10:55:38Z"
- containerID: cri-o://d80dfb0ca666279c66e96062bace1353ec58ea4ebc4285ba9d7bd96b3ca2ef2f
image: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
imageID: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
lastState:
terminated:
containerID: cri-o://d80dfb0ca666279c66e96062bace1353ec58ea4ebc4285ba9d7bd96b3ca2ef2f
exitCode: 137
finishedAt: "2024-10-09T13:24:27Z"
reason: OOMKilled
startedAt: "2024-10-09T13:24:10Z"
To Reproduce
Steps to reproduce the behavior:
- Deploy vault secrets operator in OpenShift
- Make heavy usage of the operator (currently we have 361 static secrets being synced via the operator in this cluster)
- Vault secrets operator manager pod begins to enter crash loop, yaml of the pod indicated the reason is OOMkilled.
Application deployment:
N/A
Expected behavior
CSV for the operator has enough head room in its memory limits to avoid out of memory issues in pod.
Environment
- Kubernetes version:
- Openshift 4.14.10
- vault-secrets-operator version:
- v0.5.1
Additional context
We have been able to temporarily work around this issue by manually doubling the limits memory value for the manager container in the CSV (from 256Mi to 512Mi) at key .spec.install.spec.deployments[].spec.template.spec.containers[]
.
- args:
- --health-probe-bind-address=:8081
- --metrics-bind-address=127.0.0.1:8080
- --leader-elect
command:
- /vault-secrets-operator
env:
- name: OPERATOR_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: OPERATOR_POD_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
image: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
imagePullPolicy: IfNotPresent
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20
name: manager
readinessProbe:
httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 10m
memory: 128Mi
securityContext:
allowPrivilegeEscalation: false
volumeMounts:
- mountPath: /var/run/podinfo
name: podinfo
This is not a permanent fix though since re-installing / upgrading the operator will re-instate the original memory value. We are installing via the Operator Hub in Openshift so do not have a way to permanently affect this value.
Activity