Skip to content

Conversation

@dirrao
Copy link
Contributor

@dirrao dirrao commented Aug 2, 2024

Problem: Airflow running the cleanup_stuck_queued_tasks function on a certain frequency. When we run the airflow on a large Kube cluster (pods more than > 5K). Internally the cleanup_stuck_queued_tasks function loops through each queued task (when they breach task queued timeout) and checks the corresponding worker pod existence in the Kube cluster. Right now, this existence check using list pods Kube API. The API is taking more than 1s. if there are 120 queued tasks, then it will take ~ 120 seconds (1s * 120). So, this leads the scheduler to spend most of its time in this function rather than scheduling the tasks. It leads to none of the jobs being scheduled or degraded scheduler performance.

Solution: Use single k8 list pods batch api call to get all the worker pod owned by scheduler. Prepare the set of searchable strings using pod annotations. Use this set data structure and identify whether the task associated pod exists or not. This reduces the number kube api sever calls significantly.

set elements string format:
(dag_id=<dag_id>,task_id=<task_id>,[,map_index=<map_index>],[run_id=<run_id>]

Note: Switch the worker pod and task comparison from labels to annotations to avoid extra processing (make_safe_label_value) and ensure more accurate comparisons, as annotations have no value restrictions.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Aug 2, 2024
@dirrao dirrao requested a review from potiuk August 2, 2024 14:26
@eladkal eladkal requested a review from romsharon98 August 2, 2024 14:32
@dirrao dirrao force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from 0a03529 to bef1e02 Compare August 3, 2024 06:44
@dirrao dirrao closed this Aug 3, 2024
@dirrao dirrao reopened this Aug 3, 2024
@dirrao dirrao force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from b10bf25 to 4459e0f Compare August 4, 2024 05:22
@dirrao
Copy link
Contributor Author

dirrao commented Aug 6, 2024

@jedcunningham / @hussein-awala
Can you review it whenever you are free?

@dirrao dirrao force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from 691f142 to 5b8e059 Compare August 7, 2024 12:16
@dirrao
Copy link
Contributor Author

dirrao commented Aug 10, 2024

@jedcunningham / @hussein-awala
Can you review it whenever you are free?

@dirrao dirrao requested a review from eladkal August 10, 2024 07:00
@dirrao
Copy link
Contributor Author

dirrao commented Aug 15, 2024

@potiuk / @eladkal
Can someone review this MR?

@eladkal eladkal force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from 4538ec4 to 0b1d4a5 Compare August 15, 2024 10:17
@dirrao dirrao requested a review from uranusjr August 16, 2024 09:36
@potiuk
Copy link
Member

potiuk commented Aug 21, 2024

@dirrao I have very little knowiedge of those but maybe look at the history of the releavant code and ping someone who was actively implementing it before? That's the best way to find who might be good to review it rather rather than putting that on my and @eladkal shoulders?

@jedcunningham
Copy link
Member

@dirrao can you add some details in the description? Just repeating the commit message/title isn't very useful, and having to go grok 100+ lines of change to know what the goal is isn't great for reviewing now nor next year when someone is doing git blame :)

e.g. things like what is done now, what you are doing instead, expected impact.

@dirrao
Copy link
Contributor Author

dirrao commented Aug 23, 2024

@dirrao can you add some details in the description? Just repeating the commit message/title isn't very useful, and having to go grok 100+ lines of change to know what the goal is isn't great for reviewing now nor next year when someone is doing git blame :)

e.g. things like what is done now, what you are doing instead, expected impact.

Sorry for not putting the details around the problem. I have updated the details in description of the PR.

@dirrao dirrao requested a review from jedcunningham September 5, 2024 06:34
@dirrao dirrao self-assigned this Sep 17, 2024
@eladkal
Copy link
Contributor

eladkal commented Oct 4, 2024

@dirrao can you rebase and resolve conflicts?

@dirrao
Copy link
Contributor Author

dirrao commented Oct 4, 2024

@dirrao can you rebase and resolve conflicts?

Done.

@potiuk potiuk merged commit e5a474b into apache:main Oct 7, 2024
kunaljubce pushed a commit to kunaljubce/airflow that referenced this pull request Oct 13, 2024
…1220)

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* Updated comment

* Provider change log and version updated

* Update the worker pod and task comparison from labels to annotations
joaopamaral pushed a commit to joaopamaral/airflow that referenced this pull request Oct 21, 2024
…1220)

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* Updated comment

* Provider change log and version updated

* Update the worker pod and task comparison from labels to annotations
harjeevanmaan pushed a commit to harjeevanmaan/airflow that referenced this pull request Oct 23, 2024
…1220)

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* Updated comment

* Provider change log and version updated

* Update the worker pod and task comparison from labels to annotations
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
…1220)

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* kubernetes executor cleanup_stuck_queued_tasks optimization

* Updated comment

* Provider change log and version updated

* Update the worker pod and task comparison from labels to annotations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants