v3.5.3-v3.5.10: Controller memory keeps going up due to Workflows stuck in Running
#13505
Open
3 of 4 tasks
Labels
area/controller
Controller issues, panics
solution/duplicate
This issue or PR is a duplicate of an existing one
type/bug
type/regression
Regression from previous behavior (a specific type of bug)
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
Issue Description
After upgrading the Argo Workflows controller and server from v3.5.2 to v3.5.10, we observed a significant increase in resource utilization, particularly memory usage. The controller's memory usage increased dramatically, growing without limit until it reached an Out of Memory (OOM) state, tested up to 100GiB.
Environment Details
Root Cause
The root cause was identified as 42 workflows that were stuck in the following state:
The issue may be related to a known problem discussed in #12993
Additional information:
Success
state andArchived
successfully."Zombie" Workflows Cleanup
running
kubectl delete
on these workflows didn't work, because the ArtifactGC prevents their deletion (#10840). After manually deleting the ArtifactGC, the "zombie" workflows were deleted and the memory started to decrease.Version(s)
v3.5.3
,v3.5.4
,v3.5.5
,v3.5.6
,v3.5.7
,v3.5.8
,v3.5.9
,v3.5.10
The text was updated successfully, but these errors were encountered: