Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customise PodGC time to delete in workflow-controller-configmap #10501

Closed
CosyOranges opened this issue Feb 10, 2023 · 2 comments
Closed

Customise PodGC time to delete in workflow-controller-configmap #10501

CosyOranges opened this issue Feb 10, 2023 · 2 comments
Labels
area/controller Controller issues, panics area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more solution/duplicate This issue or PR is a duplicate of an existing one type/feature Feature request

Comments

@CosyOranges
Copy link

Summary

In the workflow-controller-configmap it would be nice to add the option in the PodGC strategy for timed OnPodSuccess, etc

  • this seems like useful flexibility for users to have more say over the lifetime of completed workflow pods in their clusters.

Use Cases

When would you use this?

As background we are currently running argo-workflows 3.3.8 as part of Kubeflow 1.6.1
So we regularly run extremely large swarms of pipelines in our clusters and have to make a choice between PodGC OnPodSuccess or leaving it completely to the Workflow TTL.

  • for the most part this has been fine, but there is an interesting interaction with some other deployments that we maintiain like the cache-server as part of Kubeflow v1.6.1
    • If we run OnPodSuccess then pods are cleaned up too quickly for them to be entered into our cache-database
    • Thus we resorted to using the workflow TTL config instead, this works fine until running large swarms of pipelines which result in having ~12k pods hanging around on the cluster in a completed state and cause serious delays in the control plane
    • Being able to have more fine grained control over PodGC would be a huge benefit to us

I'd be happy to try to come up with a potential implementation for this (if it's something you would want to see in the mainstream argo-workflow but would probably need guidance 😅


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

@CosyOranges CosyOranges added the type/feature Feature request label Feb 10, 2023
@saraangelmurphy
Copy link

This is also a huge issue for my organization as well. We have very short lived pods, and running with a OnPodCompletion results in deleting pods too quickly for our logging agent to query the kubernetes API server and enrich log data with kubernetes metadata.

However, we run into trouble with IP space exhaustion when using OnWorkflowCompletion, because now we defer deletion of a given pod for 2-10 minutes, which is too long, and we have the same issue with large batch workloads that result in many templates at once. . Even a delay of 5s after podCompletion would be sufficient here, but ideally it would be a customizable value similar to the workflow TTLStrategy.

@agilgur5 agilgur5 added area/controller Controller issues, panics area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more labels Oct 15, 2024
@agilgur5
Copy link
Member

It looks like this was a duplicate of #8539 which was actually implemented in #6168 and documented in #11297.

Workflow-level deleteDelayDuration was also added in #11325

@agilgur5 agilgur5 added the solution/duplicate This issue or PR is a duplicate of an existing one label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more solution/duplicate This issue or PR is a duplicate of an existing one type/feature Feature request
Projects
None yet
Development

No branches or pull requests

3 participants