Customise PodGC time to delete in workflow-controller-configmap #10501

CosyOranges · 2023-02-10T13:59:37Z

Summary

In the workflow-controller-configmap it would be nice to add the option in the PodGC strategy for timed OnPodSuccess, etc

this seems like useful flexibility for users to have more say over the lifetime of completed workflow pods in their clusters.

Use Cases

When would you use this?

As background we are currently running argo-workflows 3.3.8 as part of Kubeflow 1.6.1
So we regularly run extremely large swarms of pipelines in our clusters and have to make a choice between PodGC OnPodSuccess or leaving it completely to the Workflow TTL.

for the most part this has been fine, but there is an interesting interaction with some other deployments that we maintiain like the cache-server as part of Kubeflow v1.6.1
- If we run OnPodSuccess then pods are cleaned up too quickly for them to be entered into our cache-database
- Thus we resorted to using the workflow TTL config instead, this works fine until running large swarms of pipelines which result in having ~12k pods hanging around on the cluster in a completed state and cause serious delays in the control plane
- Being able to have more fine grained control over PodGC would be a huge benefit to us

I'd be happy to try to come up with a potential implementation for this (if it's something you would want to see in the mainstream argo-workflow but would probably need guidance 😅

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

The text was updated successfully, but these errors were encountered:

saraangelmurphy · 2023-05-08T23:52:52Z

This is also a huge issue for my organization as well. We have very short lived pods, and running with a OnPodCompletion results in deleting pods too quickly for our logging agent to query the kubernetes API server and enrich log data with kubernetes metadata.

However, we run into trouble with IP space exhaustion when using OnWorkflowCompletion, because now we defer deletion of a given pod for 2-10 minutes, which is too long, and we have the same issue with large batch workloads that result in many templates at once. . Even a delay of 5s after podCompletion would be sufficient here, but ideally it would be a customizable value similar to the workflow TTLStrategy.

agilgur5 · 2024-10-15T20:01:20Z

It looks like this was a duplicate of #8539 which was actually implemented in #6168 and documented in #11297.

Workflow-level deleteDelayDuration was also added in #11325

CosyOranges added the type/feature Feature request label Feb 10, 2023

agilgur5 added area/controller Controller issues, panics area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more labels Oct 15, 2024

agilgur5 closed this as completed Oct 15, 2024

agilgur5 added the solution/duplicate This issue or PR is a duplicate of an existing one label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customise PodGC time to delete in workflow-controller-configmap #10501

Customise PodGC time to delete in workflow-controller-configmap #10501

CosyOranges commented Feb 10, 2023

saraangelmurphy commented May 8, 2023

agilgur5 commented Oct 15, 2024

Customise PodGC time to delete in workflow-controller-configmap #10501

Customise PodGC time to delete in workflow-controller-configmap #10501

Comments

CosyOranges commented Feb 10, 2023

Summary

Use Cases

saraangelmurphy commented May 8, 2023

agilgur5 commented Oct 15, 2024