Fix Scheduler restarting due to too many completed pods in cluster #40183

ephraimbuddy · 2024-06-11T21:15:50Z

Currently, when a pod completes and is not deleted due to the user's configuration, the watcher keeps listing these pods and checking their status. We should instead stop watching the pod once it succeeds. To do that, pods are created with the executor done label set to False and changed to True when the pod completes. The watcher then watches only those pods that the pod executor done label is False

closes: #22612

tests/providers/cncf/kubernetes/executors/test_kubernetes_executor.py

airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py

ephraimbuddy · 2024-06-11T21:38:20Z

This runs at every watcher loop and can get big in the case of many completed pods not deleted

airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py

airflow/providers/cncf/kubernetes/pod_generator.py

airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py

Currently, when a pod completes and is not deleted due to the user's configuration, the watcher keeps listing these pods and checking their status. We should instead stop watching the pod once it succeeds. To do that, pods are created with the executor done label set to False and changed to True when the pod completes. The watcher then watches only those pods that the pod executor done label is False closes: apache#22612

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

…r_utils.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

…pache#40183) * Fix Scheduler restarting due to too many completed pods in cluster Currently, when a pod completes and is not deleted due to the user's configuration, the watcher keeps listing these pods and checking their status. We should instead stop watching the pod once it succeeds. To do that, pods are created with the executor done label set to False and changed to True when the pod completes. The watcher then watches only those pods that the pod executor done label is False closes: apache#22612 * Update airflow/providers/cncf/kubernetes/pod_generator.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Add back removed section * Don't add pod key label from get go * Update airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

ephraimbuddy requested review from jedcunningham and hussein-awala as code owners June 11, 2024 21:15

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes provider related issues labels Jun 11, 2024

ephraimbuddy commented Jun 11, 2024

View reviewed changes

tests/providers/cncf/kubernetes/executors/test_kubernetes_executor.py Show resolved Hide resolved

ephraimbuddy commented Jun 11, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py Show resolved Hide resolved

ephraimbuddy force-pushed the dont-watch-completed-pods branch from 860bcb5 to 625170d Compare June 11, 2024 21:36

ephraimbuddy force-pushed the dont-watch-completed-pods branch 2 times, most recently from a1474d8 to d70cded Compare June 12, 2024 09:51

ephraimbuddy changed the title ~~Don't watch completed pods in k8s executor~~ Fix Scheduler restarting due to too many completed pods in cluster Jun 12, 2024

jedcunningham reviewed Jun 12, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py Outdated Show resolved Hide resolved

airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py Show resolved Hide resolved

jedcunningham reviewed Jun 12, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/pod_generator.py Outdated Show resolved Hide resolved

airflow/providers/cncf/kubernetes/pod_generator.py Outdated Show resolved Hide resolved

ephraimbuddy mentioned this pull request Jun 13, 2024

Tasks are in queued state for a longer time and executor slots are exhausted often #38968

Open

2 tasks

ephraimbuddy force-pushed the dont-watch-completed-pods branch from 7c32d7f to 87c4ba1 Compare June 13, 2024 09:59

eladkal requested review from romsharon98 and amoghrajesh June 13, 2024 11:42

eladkal approved these changes Jun 13, 2024

View reviewed changes

romsharon98 approved these changes Jun 13, 2024

View reviewed changes

amoghrajesh approved these changes Jun 13, 2024

View reviewed changes

jedcunningham approved these changes Jun 13, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py Outdated Show resolved Hide resolved

ephraimbuddy and others added 5 commits June 13, 2024 17:20

Update airflow/providers/cncf/kubernetes/pod_generator.py

c449ca5

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

Add back removed section

258d4dc

Don't add pod key label from get go

a791c85

Update airflow/providers/cncf/kubernetes/executors/kubernetes_executo…

171a655

…r_utils.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

ephraimbuddy force-pushed the dont-watch-completed-pods branch from 37ac60e to 171a655 Compare June 13, 2024 16:20

ephraimbuddy merged commit 67798b2 into apache:main Jun 13, 2024
59 checks passed

ephraimbuddy deleted the dont-watch-completed-pods branch June 13, 2024 20:00

eladkal mentioned this pull request Jun 22, 2024

Status of testing Providers that were prepared on June 22, 2024 #40382

Closed

96 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Scheduler restarting due to too many completed pods in cluster #40183

Fix Scheduler restarting due to too many completed pods in cluster #40183

ephraimbuddy commented Jun 11, 2024

ephraimbuddy commented Jun 11, 2024

Fix Scheduler restarting due to too many completed pods in cluster #40183

Fix Scheduler restarting due to too many completed pods in cluster #40183

Conversation

ephraimbuddy commented Jun 11, 2024

ephraimbuddy commented Jun 11, 2024