Skip to content

Conversation

@SameerMesiah97
Copy link
Contributor

@SameerMesiah97 SameerMesiah97 commented Jan 24, 2026

Description

Added best-effort cleanup to EmrCreateJobFlowOperator to terminate EMR clusters when failures occur after successful cluster creation. Cleanup behavior is guarded by a flag and is opted in by default.

In certain failure modes, the operator could previously create a cluster via create_job_flow and then fail during later execution steps (for example, while waiting for completion when DescribeCluster permissions are missing). In these cases, the task failed while leaving the cluster running. The operator now attempts to terminate the created job flow if an exception is raised after creation. Cleanup is best-effort and does not override or mask the original exception.

This change applies a similar failure-handling approach recently introduced for EC2CreateInstanceOperator in PR #60904. But cleanup is only triggered for post-start EMR job flow failures (including waiter-related errors), ensuring termination is attempted only when a job flow was successfully created and avoiding interception of non-AWS exceptions.

Rationale

EmrCreateJobFlowOperator is responsible for provisioning and coordinating an external, stateful service whose lifecycle extends beyond task execution. If the task fails after cluster creation, Airflow can no longer reliably manage or observe the cluster’s state. Adding opportunistic cleanup in these scenarios reduces the risk of orphaned EMR clusters and unexpected infrastructure costs, while preserving existing failure semantics. Cleanup errors are logged and do not affect the task’s final failure state.

Restricting cleanup to post-creation EMR job flow failures prevents unintended termination in unrelated failure paths while still addressing orphaned job flows created during execution.

Tests

  • Added a unit test covering failure after cluster creation and verifying that termination is attempted.
  • Added a unit test ensuring cleanup failures do not mask the original exception.

Documentation

The docstring for EmrCreateJobFlowOperator has been updated with a brief description of the new flag terminate_job_flow_on_failure.

Backwards Compatibility

A new flag called terminate_job_flow_on_failure has been added to EmrCreateJobFlowOperator with a default setting of True. Cleanup will now be attempted on a best-effort basis if WaiterError is encountered.

Reproduciblity

The failure scenario could not be reproduced directly due to personal AWS account permissions. However, based on the current control flow of EmrCreateJobFlowOperator, it is possible for cluster creation to succeed while a later step fails, leaving the EMR cluster running without cleanup. This change defensively addresses that case. Contributors reading this PR are free to provide a reproduction for the aforementioned failure mode if they can.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jan 24, 2026
@SameerMesiah97
Copy link
Contributor Author

@vincbeck

No pressure to review. But tagging you here as it follows the same theme as the PR for EC2 (#60904), which you merged.

@SameerMesiah97
Copy link
Contributor Author

@eladkal

This follows the same theme as #61051

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit

Attempt best-effort termination of EMR clusters when failures occur after
successful job flow creation. Cleanup does not mask the original exception.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants