Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.5.3-v3.5.10: Controller memory keeps going up due to Workflows stuck in Running #13505

Open
3 of 4 tasks
romanglo opened this issue Aug 26, 2024 · 3 comments
Open
3 of 4 tasks
Labels
area/controller Controller issues, panics solution/duplicate This issue or PR is a duplicate of an existing one type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@romanglo
Copy link

romanglo commented Aug 26, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Issue Description

After upgrading the Argo Workflows controller and server from v3.5.2 to v3.5.10, we observed a significant increase in resource utilization, particularly memory usage. The controller's memory usage increased dramatically, growing without limit until it reached an Out of Memory (OOM) state, tested up to 100GiB.

Screenshot 2024-08-13 at 17 47 15
Screenshot 2024-08-13 at 19 12 04

Environment Details

  • Argo Workflows version: Upgraded from v3.5.2 to v3.5.10
  • Persistence layer: PostgreSQL
  • Configuration:
    • nodeStatusOffLoad: true
    • archive: true
    • workflowWorkers: 128
    • workflowTTLWorkers: 16
    • podCleanupWorkers: 32
    • DEFAULT_REQUEUE_TIME: 1m
    • QPS: 100
    • Burst: 150

Root Cause

The root cause was identified as 42 workflows that were stuck in the following state:

level=error msg="Unable to set ExecWorkflow" error="failed to set global parameter cool-configs from configmap with name convert and key configs.json: ConfigMap 'configmap.convert' does not exist. Please make sure it has the label workflows.argoproj.io/configmap-type: Parameter to be detectable by the controller" namespace=default workflow=cool-workflow-5726g

The issue may be related to a known problem discussed in #12993

Additional information:
  • These workflows were in Success state and Archived successfully.
  • Most of them are over a month old.
  • This issue appears to have been introduced in v3.5.3, as v3.5.2 was able to handle these workflows without causing a memory leak.
  • Most of the workflows are over a month old
  • There was nothing unusual in the metrics, these workflows didn't appear in queues metrics either.

"Zombie" Workflows Cleanup

running kubectl delete on these workflows didn't work, because the ArtifactGC prevents their deletion (#10840). After manually deleting the ArtifactGC, the "zombie" workflows were deleted and the memory started to decrease.

Version(s)

v3.5.3,v3.5.4,v3.5.5,v3.5.6,v3.5.7,v3.5.8,v3.5.9,v3.5.10

@romanglo romanglo added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Aug 26, 2024
@agilgur5
Copy link
Member

agilgur5 commented Aug 26, 2024

This is a follow-up from Slack, but I think my directions got a little mixed or forgotten in the delay 😅

The issue may be related to a known problem discussed in #12993

Per Slack, yes that would be the root cause, and the memory going up due to stuck Workflows a symptom. It's also not a memory leak in the Controller; there are actually more Workflows, so the memory goes up. The stuck Workflows are due to a reconciliation bug, rather than a memory leak

The root cause was identified as 42 workflows that were stuck in the following state:

This error message I had said to post in that issue, as the fix would need to catch that. A reproducible Workflow could help as a regression test, but that part's actually missing in this issue 😅

Since the issue has been closed now, we can leave this open until it is confirmed fixed on :latest, but will need a repro for that

@agilgur5 agilgur5 added area/controller Controller issues, panics solution/duplicate This issue or PR is a duplicate of an existing one labels Aug 26, 2024
@agilgur5 agilgur5 changed the title Potential memory leak issue starting from v3.5.2 (tested up to v3.5.10, inclusive) v3.5.3+: Controller memory keeps going up due to Workflows stuck in Running Aug 26, 2024
@alexec
Copy link
Contributor

alexec commented Sep 8, 2024

Workflow controller should not leak memory, even if there's something weird going on with running workflows.

The best thing to do is to use pprof. This has helped fix similar issues in the past.

https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/

Enable pprof using env PPROF=true. The controller will start pprof on port 6060. You can then capture diagnostics that should point to the culprit.

This is covered in the stress testing doc.

@alexec
Copy link
Contributor

alexec commented Sep 8, 2024

50Gi is a crazy amount of memory for the controller to use.

@agilgur5 agilgur5 changed the title v3.5.3+: Controller memory keeps going up due to Workflows stuck in Running v3.5.3: Controller memory keeps going up due to Workflows stuck in Running Oct 8, 2024
@agilgur5 agilgur5 changed the title v3.5.3: Controller memory keeps going up due to Workflows stuck in Running v3.5.10: Controller memory keeps going up due to Workflows stuck in Running Oct 8, 2024
@agilgur5 agilgur5 changed the title v3.5.10: Controller memory keeps going up due to Workflows stuck in Running v3.5.3-v3.5.10: Controller memory keeps going up due to Workflows stuck in Running Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics solution/duplicate This issue or PR is a duplicate of an existing one type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

3 participants