-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Catch arbitrary exception from run_job to prevent zombie scheduler #32707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch arbitrary exception from run_job to prevent zombie scheduler #32707
Conversation
0944f08 to
c5f860f
Compare
|
This needs a test case. |
|
Thank you for the feedback. I have added a test case that mocked |
b3d9fd9 to
b5554b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| log = logging.getLogger(__name__) | |
| logger = logging.getLogger(__name__) |
Maybe this? Usually I see it's called logger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please point to where the logger is named logger? The modules I've touched so far all named logger as log example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. The example you shared is using self.log and not log. I believe this comes from extending the LoggingMixin class which has the log property.
What I meant is in general when using logging.getLogger(__name__), python modules call it as logger across the projects I have seen in so far. One example in our project I could find is here
| logger = logging.getLogger(__name__) |
But I think I have another suggestion here. Since this is a CLI command I guess we could maybe use the AirflowConsole here to print the message? e.g. https://github.com/apache/airflow/blob/main/airflow/cli/commands/connection_command.py#L360
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming variables one way or another seems ultimately trivial, so I am perfectly okay with changing the variable name to logger.
On the other hand, I am not sure about using AirflowConsole for logging purposes. The code changes are located under the cli module, but they are ultimately meant to catch exceptions that came up within the inner logic of the scheduler, so I think it should be treated like a normal logger.
Another situation where I think using a logger is more appropriate is when the scheduler is run as a daemon (with airflow scheduler -D), in which case using the AirflowConsole will not log the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming on this is unfortunately pretty inconsistent in Airflow. Personally I much prefer logger but 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also log what the exception message was?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.python.org/3/library/logging.html#logging.exception
Calling log.exception will automatically include the exception information so there is no need to include it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sorry missed reading the logging level.
ebf06f3 to
c927469
Compare
ff7b2d5 to
2883da3
Compare
991df0c to
6897d23
Compare
This PR attempts to address issue #32706
When
run_jobraises unhandled exception, the scheduler process will become zombie. Placingrun_jobinside a try-catch block will allowserve_logsandhealth_checkto properly terminate the sub-processes.