Skip to content

Conversation

@dominikhei
Copy link
Contributor


closes: #42401

I have introduced a cancel_job method to the EMRServerlessHook, which wraps the cancel_job_run method from boto3.

In cases of a non deferrable job run, if an Exception that waiter_max_attempts has been reached is thrown, cancel_job is executed. If deferrable is set to True, the cancellation logic is placed inside execute_complete, as this method evaluates the job state in this case.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jun 18, 2025
@dominikhei dominikhei marked this pull request as ready for review June 18, 2025 12:55
Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a very opinionated decision. I am wondering if this is not something the user should set by using on_failure_callback and not us to take this decision.

I'd like to hear more thoughts on that from others.

@dominikhei
Copy link
Contributor Author

dominikhei commented Jun 18, 2025

I feel like this is a very opinionated decision. I am wondering if this is not something the user should set by using on_failure_callback and not us to take this decision.

I'd like to hear more thoughts on that from others.

Apologies if there is an obvious answer, but is there a use case where you would want the job to not be cancelled in EMR if a new one is created due to retries, now running / pending concurrently?

@vincbeck
Copy link
Contributor

I feel like this is a very opinionated decision. I am wondering if this is not something the user should set by using on_failure_callback and not us to take this decision.
I'd like to hear more thoughts on that from others.

Apologies if this is an obvious question, but is there a use case where you would want the job to not be cancelled in EMR if a new one is created due to retries, now running / pending concurrently?

Hard to know all the different user use cases but I think you're correct, I do not see any, so I am probably wrong in my perception :)

@dominikhei
Copy link
Contributor Author

dominikhei commented Jun 18, 2025

I feel like this is a very opinionated decision. I am wondering if this is not something the user should set by using on_failure_callback and not us to take this decision.
I'd like to hear more thoughts on that from others.

Apologies if this is an obvious question, but is there a use case where you would want the job to not be cancelled in EMR if a new one is created due to retries, now running / pending concurrently?

Hard to know all the different user use cases but I think you're correct, I do not see any, so I am probably wrong in my perception :)

That’s true, there’s definetly a point in letting the user decide. As you said lets wait on other opinions :)

@dominikhei
Copy link
Contributor Author

@o-nikolas What is your take on this? Having thought about it again, this would change the standard behavior that also comes with other AWS operators (e. g. EMR cancels the job on failure, Glue doesn't), speaking more for using on_failure_callback.

@o-nikolas
Copy link
Contributor

@o-nikolas What is your take on this? Having thought about it again, this would change the standard behavior that also comes with other AWS operators (e. g. EMR cancels the job on failure, Glue doesn't), speaking more for using on_failure_callback.

It would be a change of standard behaviour and also be a breaking change (the behaviour that the user sees will be noticeably different), so if we went that route we'd need to do a deprecation process. We could argue that it's a bug fix (as you describe, no one would really want the default behaviour we have) and then that would allow us to not have to go through the deprecation process. Or as Vincent said, we could avoid all that and just document a way around this with callbacks.

I personally don't feel too strongly about it and would be okay with either of the three above.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 19, 2025
@github-actions github-actions bot closed this Aug 25, 2025
@vincbeck
Copy link
Contributor

Hey @dominikhei, I might have changed my mind and I think your changes make sense. I agree with you, there is more chances that a user wants their job to be cancelled if the timer times out than not. Are you still around and if so, would you be interested to continue working on this PR?

@dominikhei
Copy link
Contributor Author

@vincbeck Took some time off due to a new job but wanted to get back into contributing regularly so this might be a good start :)

@vincbeck
Copy link
Contributor

Please do :) I reopen this PR

@vincbeck vincbeck reopened this Nov 28, 2025
@vincbeck
Copy link
Contributor

But please rebase your PR so that it uses the latest up to date code

@vincbeck
Copy link
Contributor

vincbeck commented Dec 9, 2025

Are you still planning to work on it?

@dominikhei
Copy link
Contributor Author

Are you still planning to work on it?

Yes, however I can not start before this Sunday, should have mentioned that beforehand, sorry.
If this timeline is too slow, please feel free to reassign it.

@vincbeck
Copy link
Contributor

vincbeck commented Dec 9, 2025

No rush at all :) I was just checking :)

@dominikhei
Copy link
Contributor Author

@vincbeck @o-nikolas Would you still consider this a breaking change?

@vincbeck
Copy link
Contributor

I think we can consider it as bug fix

@vincbeck vincbeck merged commit cbaa369 into apache:main Jan 14, 2026
89 checks passed
jason810496 pushed a commit to jason810496/airflow that referenced this pull request Jan 22, 2026
…che#51883)

* Adjusted the EMRServerlessStartJobOperator to cancel submited jobs on failure

* Removed hook.cancel_job_run and adjusted the return value of EmrServerlessStartJobTrigger

* Added additional tests for the job cancellation behavior

* Fixed ruff formating errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EmrServerlessStartJobOperator does not cancel EMR Serverless job when waiter_max_attempts is reached

3 participants