-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Avoid resetting adopted task instances when retrying for kubernetes executor #39406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid resetting adopted task instances when retrying for kubernetes executor #39406
Conversation
|
Tested on my environment, confirmed that in the second attempt, the task instances whose pods have been adopted are not flushed anymore |
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Outdated
Show resolved
Hide resolved
|
@jedcunningham @hussein-awala |
|
@jedcunningham @hussein-awala @potiuk @eladkal |
dc5da55 to
862d30e
Compare
eladkal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I would prefer someone else to take a look on this
romsharon98
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Outdated
Show resolved
Hide resolved
Lee-W
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible for us to add a test to it?
|
Thank you all for your reviews! |
|
@tanvn is it possible to cover this fix with unit tests to avoid regression? |
@eladkal Let me take a look! |
…xecutor (apache#39406) * Avoid resetting adopted task instances when retrying * Stop using f-string when logging * Address comment * Remove return type of generator func * Add unit test * Add comment and fix linter error
Closes: #39088
As described in #39088 (comment),
when an OperationalError happens (in my case, it was MySQLdb.OperationalError),
the method
executor.try_adopt_task_instances(tis_to_adopt_or_reset)is retried,and in the first attempt, all running pods have been adopted successfully, so
adopt_launched_taskwill not be called even once which leads to a situation thattis_to_flush_by_keyin the second attempt contains TIs that are already adopted.The TIs then are flushed unnecessarily, and could be annoying as some heavy tasks get terminated and retried.
Notes
Please note that this PR only fixes the issue happening with kubernetes executor.
(It might happen with other executors as well but I am only using kubernetes executor at the present and not familiar with the others)
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.