Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Scheduler restarting due to too many completed pods in cluster #40183

Merged
merged 5 commits into from
Jun 13, 2024

Conversation

ephraimbuddy
Copy link
Contributor

Currently, when a pod completes and is not deleted due to the user's configuration, the watcher keeps listing these pods and checking their status. We should instead stop watching the pod once it succeeds. To do that, pods are created with the executor done label set to False and changed to True when the pod completes. The watcher then watches only those pods that the pod executor done label is False

closes: #22612

@ephraimbuddy
Copy link
Contributor Author

This runs at every watcher loop and can get big in the case of many completed pods not deleted
Screenshot 2024-06-11 at 16 38 20

@ephraimbuddy ephraimbuddy force-pushed the dont-watch-completed-pods branch 2 times, most recently from a1474d8 to d70cded Compare June 12, 2024 09:51
@ephraimbuddy ephraimbuddy changed the title Don't watch completed pods in k8s executor Fix Scheduler restarting due to too many completed pods in cluster Jun 12, 2024
ephraimbuddy and others added 5 commits June 13, 2024 17:20
Currently, when a pod completes and is not deleted due to the user's configuration,
the watcher keeps listing these pods and checking their status. We should instead stop
watching the pod once it succeeds. To do that, pods are created with the executor done
label set to False and changed to True when the pod completes. The watcher then watches
only those pods that the pod executor done label is False

closes: apache#22612
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
…r_utils.py

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
@ephraimbuddy ephraimbuddy merged commit 67798b2 into apache:main Jun 13, 2024
59 checks passed
@ephraimbuddy ephraimbuddy deleted the dont-watch-completed-pods branch June 13, 2024 20:00
jannisko pushed a commit to jannisko/airflow that referenced this pull request Jun 15, 2024
…pache#40183)

* Fix Scheduler restarting due to too many completed pods in cluster

Currently, when a pod completes and is not deleted due to the user's configuration,
the watcher keeps listing these pods and checking their status. We should instead stop
watching the pod once it succeeds. To do that, pods are created with the executor done
label set to False and changed to True when the pod completes. The watcher then watches
only those pods that the pod executor done label is False

closes: apache#22612

* Update airflow/providers/cncf/kubernetes/pod_generator.py

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* Add back removed section

* Don't add pod key label from get go

* Update airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

---------

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
…pache#40183)

* Fix Scheduler restarting due to too many completed pods in cluster

Currently, when a pod completes and is not deleted due to the user's configuration,
the watcher keeps listing these pods and checking their status. We should instead stop
watching the pod once it succeeds. To do that, pods are created with the executor done
label set to False and changed to True when the pod completes. The watcher then watches
only those pods that the pod executor done label is False

closes: apache#22612

* Update airflow/providers/cncf/kubernetes/pod_generator.py

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* Add back removed section

* Don't add pod key label from get go

* Update airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

---------

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schedular going down for 1-2 minute on every 10 minute as increase completed pods in EKS
5 participants