-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need to better handle large #'s of workflows and warn users to cleanup them #2190
Comments
Interesting problem. One solution, obviously, is to delete old workflows. We could always increase timeouts or change code to deal with this - but then we get to 20k workflows and the problem re-appears. Any solution therefore must involve deleting old workflows.
|
I don't think 10k+ workflows is healthy, it was just a behavior that occurs when the person who sets up use of Argo in pipeline automation moves to another team/project before configuring cleanup. Our complaint is more of the the way it breaks when it gets there, as the log messages aren't informative to the user that they need to delete workflows. It took us several hours to figure this was the problem, and we would have expected "argo list" to fail gracefully and warn the user there were too many workflows to work properly. |
We ran into this issue too. We now must have Even worse, because we weren't cleaning up our |
Hi, I am experiencing an issue where workflow-controller eats up too much memory (up to 5GB), reaching the nodes full capacity, causing ti to be evicted in a loop. I am running argo as a a part of a Kubeflow deployment, and the amounts of pipelines running is much (10s and not thousands) |
Would a warning in the user interface be useful? |
Hi alexec, not sure, is there a way to estimate the required memory usage per 1 kf pipeline/argo workflow |
I might have snatched the wrong issue, but my issue happened while initiating 10 workflows together, each one containing about a hundred of pods or so |
Have you enabled podGC on your workflows? |
If by that you mean garbage collecting finished pods, yes. There were no pods up when this happened, I think I can reproduce this easily |
Fixed in #3089. |
Checklist:
What happened:
On one of our K8s clusters, not all, we are getting repeated errors in the workflow controller of:
which maps to https://github.com/argoproj/argo/blob/v2.4.3/workflow/controller/controller.go#L156
Turns out we had 13450 workflows built up from over the last year sitting around in the 'workflows' namespace because there still is no automatic garbage collection with argo. Running argo list -n workflows would actually crash the workflow-controller pod (causing it to go into evicted state), after it returned:
macbook-pro-2:argo lindj@us.ibm.com$ argo list -n workflows
2020/02/06 18:43:48 rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2151705869 vs. 2147483647)
I'm assuming the logic inside the workflow controller is just trying to fully fetch all the workflows at once rather than in chunks or applying a maximum upper bound, which is causing the timeout and crashes.
What you expected to happen:
I expected the processing by the wfInformer to report it was encountering too many workflows and recommend manual pruning of them in the controller logs. Similarly, in the error path that returns "Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request", I would expect it to report what URL/REST-request it was issuing.
How to reproduce it (as minimally and precisely as possible):
Just need to create roughly 13000+ workflows to stick around in 'workflows' namespace. The failure happens with both v2.3.0 and v.2.4.3, just different error line # for 2.3.0, of course.
Anything else we need to know?:
Environment:
Other debugging information (if applicable):
Logs
Message from the maintainers:
If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered: