Need to better handle large #'s of workflows and warn users to cleanup them #2190

jimlindeman · 2020-02-07T09:03:50Z

Checklist:

[ X ] I've included the version.
[ X ] I've included reproduction steps.
I've included the workflow YAML.
[ X] I've included the logs.

What happened:
On one of our K8s clusters, not all, we are getting repeated errors in the workflow controller of:

E0206 22:46:53.953386       1 reflector.go:126] github.com/argoproj/argo/workflow/controller/controller.go:156: Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request

which maps to https://github.com/argoproj/argo/blob/v2.4.3/workflow/controller/controller.go#L156

Turns out we had 13450 workflows built up from over the last year sitting around in the 'workflows' namespace because there still is no automatic garbage collection with argo. Running argo list -n workflows would actually crash the workflow-controller pod (causing it to go into evicted state), after it returned:
macbook-pro-2:argo lindj@us.ibm.com$ argo list -n workflows
2020/02/06 18:43:48 rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2151705869 vs. 2147483647)

I'm assuming the logic inside the workflow controller is just trying to fully fetch all the workflows at once rather than in chunks or applying a maximum upper bound, which is causing the timeout and crashes.

What you expected to happen:
I expected the processing by the wfInformer to report it was encountering too many workflows and recommend manual pruning of them in the controller logs. Similarly, in the error path that returns "Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request", I would expect it to report what URL/REST-request it was issuing.

How to reproduce it (as minimally and precisely as possible):
Just need to create roughly 13000+ workflows to stick around in 'workflows' namespace. The failure happens with both v2.3.0 and v.2.4.3, just different error line # for 2.3.0, of course.

Anything else we need to know?:

Environment:

Argo version: v2.3.0 and v.2.4.3

$ argo version

Kubernetes version :

clientVersion:
  buildDate: "2019-12-07T21:20:10Z"
  compiler: gc
  gitCommit: 70132b0f130acc0bed193d9ba59dd186f0e634cf
  gitTreeState: clean
  gitVersion: v1.17.0
  goVersion: go1.13.4
  major: "1"
  minor: "17"
  platform: darwin/amd64
serverVersion:
  buildDate: "2020-01-16T04:08:27Z"
  compiler: gc
  gitCommit: 18e8565daf60eb3a20c0ac29a7d3a93622659e4d
  gitTreeState: clean
  gitVersion: v1.14.10+IKS
  goVersion: go1.12.12
  major: "1"
  minor: "14"
  platform: linux/amd64

Other debugging information (if applicable):

workflow result:

argo get <workflowname>

executor logs:

kubectl logs <failedpodname> -c init
kubectl logs <failedpodname> -c wait

workflow-controller logs:

kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)

Logs

argo get <workflowname>
kubectl logs <failedpodname> -c init
kubectl logs <failedpodname> -c wait
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

The text was updated successfully, but these errors were encountered:

alexec · 2020-02-07T16:16:30Z

Interesting problem.

One solution, obviously, is to delete old workflows. We could always increase timeouts or change code to deal with this - but then we get to 20k workflows and the problem re-appears.

Any solution therefore must involve deleting old workflows.

What is the use case for 10k + workflows in your system?
Have you considered trying out the workflow archive feature in v2.5?

jimlindeman · 2020-02-07T16:40:40Z

I don't think 10k+ workflows is healthy, it was just a behavior that occurs when the person who sets up use of Argo in pipeline automation moves to another team/project before configuring cleanup.

Our complaint is more of the the way it breaks when it gets there, as the log messages aren't informative to the user that they need to delete workflows. It took us several hours to figure this was the problem, and we would have expected "argo list" to fail gracefully and warn the user there were too many workflows to work properly.

thundergolfer · 2020-02-20T03:12:08Z

We ran into this issue too. We now must have ttlSecondsAfterFinished: X, on every workflow we run otherwise the workflow objects build up and you eventually can't get info out of argo CLI.

Even worse, because we weren't cleaning up our workflow objects any PersistentVolumes associated with them weren't getting deleted in AWS. At one point we had ~1000 EBS volumes sitting around doing nothing, having had their workflow terminate long ago 😳.

ghost · 2020-05-22T05:15:17Z

Hi,
+1 this.

I am experiencing an issue where workflow-controller eats up too much memory (up to 5GB), reaching the nodes full capacity, causing ti to be evicted in a loop.

I am running argo as a a part of a Kubeflow deployment, and the amounts of pipelines running is much (10s and not thousands)

alexec · 2020-05-22T15:46:24Z

Would a warning in the user interface be useful?

ghost · 2020-05-22T17:06:39Z

Hi alexec, not sure, is there a way to estimate the required memory usage per 1 kf pipeline/argo workflow

alexec · 2020-05-22T17:40:39Z

Please read this help: https://github.com/argoproj/argo/blob/6999dec21a54f1700dc68c6e480a0cbae6f28301/docs/cost-optimisation.md

ghost · 2020-05-22T17:43:33Z

I might have snatched the wrong issue, but my issue happened while initiating 10 workflows together, each one containing about a hundred of pods or so

alexec · 2020-05-22T17:45:03Z

Have you enabled podGC on your workflows?

ghost · 2020-05-22T17:49:08Z

If by that you mean garbage collecting finished pods, yes.

There were no pods up when this happened, I think I can reproduce this easily

alexec · 2020-06-30T23:46:37Z

Fixed in #3089.

jimlindeman added the type/bug label Feb 7, 2020

alexec mentioned this issue May 22, 2020

feat(ui): Add cost optimisation nudges. #3089

Merged

6 tasks

alexec closed this as completed Jun 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to better handle large #'s of workflows and warn users to cleanup them #2190

Need to better handle large #'s of workflows and warn users to cleanup them #2190

jimlindeman commented Feb 7, 2020 •

edited

Loading

alexec commented Feb 7, 2020

jimlindeman commented Feb 7, 2020

thundergolfer commented Feb 20, 2020 •

edited

Loading

ghost commented May 22, 2020

alexec commented May 22, 2020

ghost commented May 22, 2020

alexec commented May 22, 2020

ghost commented May 22, 2020 •

edited by ghost

Loading

alexec commented May 22, 2020

ghost commented May 22, 2020

alexec commented Jun 30, 2020

Need to better handle large #'s of workflows and warn users to cleanup them #2190

Need to better handle large #'s of workflows and warn users to cleanup them #2190

Comments

jimlindeman commented Feb 7, 2020 • edited Loading

alexec commented Feb 7, 2020

jimlindeman commented Feb 7, 2020

thundergolfer commented Feb 20, 2020 • edited Loading

ghost commented May 22, 2020

alexec commented May 22, 2020

ghost commented May 22, 2020

alexec commented May 22, 2020

ghost commented May 22, 2020 • edited by ghost Loading

alexec commented May 22, 2020

ghost commented May 22, 2020

alexec commented Jun 30, 2020

jimlindeman commented Feb 7, 2020 •

edited

Loading

thundergolfer commented Feb 20, 2020 •

edited

Loading

ghost commented May 22, 2020 •

edited by ghost

Loading