Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Kubernetes Executor & Scheduler Resilient to error during PMH execution #27611

Merged
merged 4 commits into from
Nov 11, 2022

Conversation

XD-DENG
Copy link
Member

@XD-DENG XD-DENG commented Nov 11, 2022

Currently there is no try-catch logic when pod_mutation_hook is called in the Kubernetes Executor. So if there is any error/exception during the execution of the pod_mutation_hook user created, the whole executor/scheduler will crash.

Such error may happen:

  • user who authored the pod_mutation_hook made an error inside.
  • if there is any code inside the pod_mutation_hook encounters transient error (e.g. it queries DB and the DB gives a timeout error).
  • ... you name it. It can simply go wrong.

This PR aims to make Kubernetes Executor & Scheduler more resilient to such error during PMH execution (current behavior, directly crashing, is a bit scary)


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:Scheduler including HA (high availability) scheduler labels Nov 11, 2022
@XD-DENG XD-DENG requested review from uranusjr, ashb and kaxil November 11, 2022 00:01
@XD-DENG XD-DENG force-pushed the resilient-scheduler-to-pmh-error branch from f80ed32 to e46a992 Compare November 11, 2022 00:23
@XD-DENG XD-DENG merged commit 16b1001 into apache:main Nov 11, 2022
@XD-DENG XD-DENG modified the milestones: Airflow 2.4.3, Airflow 2.4.4 Nov 11, 2022
Adityamalik123 pushed a commit to Adityamalik123/airflow that referenced this pull request Nov 12, 2022
…ecution (apache#27611)

Earlier there is no try-catch logic when pod_mutation_hook is called in the Kubernetes Executor. So if there is any error/exception during the execution of the pod_mutation_hook user created, the whole executor/scheduler will crash.

This change makes Kubernetes Executor & Scheduler more resilient to such errors during PMH execution
@ephraimbuddy ephraimbuddy added the type:improvement Changelog: Improvements label Nov 16, 2022
@ephraimbuddy ephraimbuddy modified the milestones: Airflow 2.4.4, Airflow 2.5.0 Nov 16, 2022
try:
pod_mutation_hook(pod)
except Exception as e:
raise PodMutationHookException(e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be raise PodMutationHookException() from e ? Otherwise it will be logged as an exception happening while handling another exception.

Copy link
Member Author

@XD-DENG XD-DENG Dec 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hterik , thanks for the feedback!

I tested the two different ways below. I find the difference is minor, and I personally prefer the current way where the detailed exception reason appears in the same line as the PodMutationHookException, so it's clearer to the log reader (this is also the original reason for what I wrote it this way). WDYT?

Case -1

from airflow.exceptions import PodMutationHookException

try:
    int("a")
except Exception as e:
    raise PodMutationHookException(e)

We get

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
ValueError: invalid literal for int() with base 10: 'a'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
airflow.exceptions.PodMutationHookException: invalid literal for int() with base 10: 'a'

Case -2

If we use the way you suggested

from airflow.exceptions import PodMutationHookException

try:
    int("a")
except Exception as e:
    raise PodMutationHookException() from e

We get

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
ValueError: invalid literal for int() with base 10: 'a'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
airflow.exceptions.PodMutationHookException

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better indeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, maybe raise PodMutationHookException(e) from e in that case ;) That will have best of both worlds. But yeah, i think the difference is minor and i don't think it has any logical difference other than the log message in between the two tracebacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler provider:cncf-kubernetes Kubernetes provider related issues type:improvement Changelog: Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants