Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to better handle large #'s of workflows and warn users to cleanup them #2190

Closed
1 task
jimlindeman opened this issue Feb 7, 2020 · 11 comments
Closed
1 task
Labels

Comments

@jimlindeman
Copy link

jimlindeman commented Feb 7, 2020

Checklist:

  • [ X ] I've included the version.
  • [ X ] I've included reproduction steps.
  • I've included the workflow YAML.
  • [ X] I've included the logs.

What happened:
On one of our K8s clusters, not all, we are getting repeated errors in the workflow controller of:

E0206 22:46:53.953386       1 reflector.go:126] github.com/argoproj/argo/workflow/controller/controller.go:156: Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request

which maps to https://github.com/argoproj/argo/blob/v2.4.3/workflow/controller/controller.go#L156

Turns out we had 13450 workflows built up from over the last year sitting around in the 'workflows' namespace because there still is no automatic garbage collection with argo. Running argo list -n workflows would actually crash the workflow-controller pod (causing it to go into evicted state), after it returned:
macbook-pro-2:argo lindj@us.ibm.com$ argo list -n workflows
2020/02/06 18:43:48 rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2151705869 vs. 2147483647)

I'm assuming the logic inside the workflow controller is just trying to fully fetch all the workflows at once rather than in chunks or applying a maximum upper bound, which is causing the timeout and crashes.

What you expected to happen:
I expected the processing by the wfInformer to report it was encountering too many workflows and recommend manual pruning of them in the controller logs. Similarly, in the error path that returns "Failed to list *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request", I would expect it to report what URL/REST-request it was issuing.

How to reproduce it (as minimally and precisely as possible):
Just need to create roughly 13000+ workflows to stick around in 'workflows' namespace. The failure happens with both v2.3.0 and v.2.4.3, just different error line # for 2.3.0, of course.

Anything else we need to know?:

Environment:

  • Argo version: v2.3.0 and v.2.4.3
$ argo version
  • Kubernetes version :
clientVersion:
  buildDate: "2019-12-07T21:20:10Z"
  compiler: gc
  gitCommit: 70132b0f130acc0bed193d9ba59dd186f0e634cf
  gitTreeState: clean
  gitVersion: v1.17.0
  goVersion: go1.13.4
  major: "1"
  minor: "17"
  platform: darwin/amd64
serverVersion:
  buildDate: "2020-01-16T04:08:27Z"
  compiler: gc
  gitCommit: 18e8565daf60eb3a20c0ac29a7d3a93622659e4d
  gitTreeState: clean
  gitVersion: v1.14.10+IKS
  goVersion: go1.12.12
  major: "1"
  minor: "14"
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
argo get <workflowname>
  • executor logs:
kubectl logs <failedpodname> -c init
kubectl logs <failedpodname> -c wait
  • workflow-controller logs:
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)

Logs

argo get <workflowname>
kubectl logs <failedpodname> -c init
kubectl logs <failedpodname> -c wait
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@alexec
Copy link
Contributor

alexec commented Feb 7, 2020

Interesting problem.

One solution, obviously, is to delete old workflows. We could always increase timeouts or change code to deal with this - but then we get to 20k workflows and the problem re-appears.

Any solution therefore must involve deleting old workflows.

  • What is the use case for 10k + workflows in your system?
  • Have you considered trying out the workflow archive feature in v2.5?

@jimlindeman
Copy link
Author

I don't think 10k+ workflows is healthy, it was just a behavior that occurs when the person who sets up use of Argo in pipeline automation moves to another team/project before configuring cleanup.

Our complaint is more of the the way it breaks when it gets there, as the log messages aren't informative to the user that they need to delete workflows. It took us several hours to figure this was the problem, and we would have expected "argo list" to fail gracefully and warn the user there were too many workflows to work properly.

@thundergolfer
Copy link

thundergolfer commented Feb 20, 2020

We ran into this issue too. We now must have ttlSecondsAfterFinished: X, on every workflow we run otherwise the workflow objects build up and you eventually can't get info out of argo CLI.

Even worse, because we weren't cleaning up our workflow objects any PersistentVolumes associated with them weren't getting deleted in AWS. At one point we had ~1000 EBS volumes sitting around doing nothing, having had their workflow terminate long ago 😳.

@ghost
Copy link

ghost commented May 22, 2020

Hi,
+1 this.

I am experiencing an issue where workflow-controller eats up too much memory (up to 5GB), reaching the nodes full capacity, causing ti to be evicted in a loop.

I am running argo as a a part of a Kubeflow deployment, and the amounts of pipelines running is much (10s and not thousands)

@alexec
Copy link
Contributor

alexec commented May 22, 2020

Would a warning in the user interface be useful?

@ghost
Copy link

ghost commented May 22, 2020

Hi alexec, not sure, is there a way to estimate the required memory usage per 1 kf pipeline/argo workflow

@alexec
Copy link
Contributor

alexec commented May 22, 2020

@ghost
Copy link

ghost commented May 22, 2020

I might have snatched the wrong issue, but my issue happened while initiating 10 workflows together, each one containing about a hundred of pods or so

@alexec
Copy link
Contributor

alexec commented May 22, 2020

Have you enabled podGC on your workflows?

@ghost
Copy link

ghost commented May 22, 2020

If by that you mean garbage collecting finished pods, yes.

There were no pods up when this happened, I think I can reproduce this easily

@alexec
Copy link
Contributor

alexec commented Jun 30, 2020

Fixed in #3089.

@alexec alexec closed this as completed Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants