Skip to content

Handle CrawlerRunningException gracefully in GlueCrawlerOperator#62016

Merged
vincbeck merged 4 commits intoapache:mainfrom
bahram-cdt:fix/handle-crawler-running-exception
Feb 17, 2026
Merged

Handle CrawlerRunningException gracefully in GlueCrawlerOperator#62016
vincbeck merged 4 commits intoapache:mainfrom
bahram-cdt:fix/handle-crawler-running-exception

Conversation

@bahram-cdt
Copy link
Contributor

@bahram-cdt bahram-cdt commented Feb 16, 2026

What

Handle CrawlerRunningException in GlueCrawlerOperator.execute() instead of letting it fail the Airflow task.

Why

When start_crawler() or update_crawler() is called while the crawler is already running (e.g., from a retry, overlapping DAG run, or boto3 internal retry after a timeout), the AWS Glue API raises CrawlerRunningException. Currently this propagates as an unhandled ClientError, causing the Airflow task to fail even though the crawler run completes successfully.

This is a common issue in production: the Glue console shows the crawler succeeded, but Airflow marks the task as failed and triggers alerts.

What Changed

providers/amazon/src/airflow/providers/amazon/aws/operators/glue_crawler.py

  • Wrapped update_crawler() with try/except: catches CrawlerRunningException and logs a warning (skips the update since the crawler is busy).
  • Wrapped start_crawler() with try/except: catches CrawlerRunningException and logs a warning (waits for the existing run instead of failing).
  • All other ClientError codes are re-raised as before.
  • Added from botocore.exceptions import ClientError import.

providers/amazon/tests/unit/amazon/aws/operators/test_glue_crawler.py

  • test_execute_crawler_running_on_start: verifies CrawlerRunningException on start_crawler is caught and the operator waits for the existing run.
  • test_execute_crawler_running_on_update: verifies CrawlerRunningException on update_crawler is caught and start_crawler is still called.
  • test_execute_other_client_error_on_start_raises: verifies non-CrawlerRunningException errors on start_crawler propagate.
  • test_execute_other_client_error_on_update_raises: verifies non-CrawlerRunningException errors on update_crawler propagate.

How to Test

# Simulate CrawlerRunningException
from botocore.exceptions import ClientError
error = ClientError(
    error_response={"Error": {"Code": "CrawlerRunningException", "Message": "Already running"}},
    operation_name="StartCrawler",
)
# Previously: operator.execute() raises ClientError -> task fails
# Now: operator catches it, logs warning, waits for existing run -> task succeeds

@boring-cyborg
Copy link

boring-cyborg bot commented Feb 16, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Feb 16, 2026
@bahram-cdt bahram-cdt force-pushed the fix/handle-crawler-running-exception branch from 24bf2b4 to 370736f Compare February 16, 2026 15:19
@eladkal eladkal requested a review from vincbeck February 16, 2026 15:20
@vincbeck
Copy link
Contributor

Please fix static checks

@bahram-cdt bahram-cdt force-pushed the fix/handle-crawler-running-exception branch from 370736f to fb398f4 Compare February 16, 2026 17:09
Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kinds of makes sense to receive this exception no? You're trying to update or a start a new job but it fails to do so because there is already one running. What do you think?

@bahram-cdt
Copy link
Contributor Author

It kinds of makes sense to receive this exception no? You're trying to update or a start a new job but it fails to do so because there is already one running. What do you think?

Good point! The exception is indeed telling us something meaningful — but I'd argue the correct response is to wait for the existing run, not to fail the task.

The most common cause in production is retry-induced race conditions: boto3's built-in retry fires start_crawler() a second time after a network timeout on the first (successful) call. The user didn't do anything wrong, and the crawler will complete successfully, but the Airflow task fails and triggers false alerts.

Since the operator already supports wait_for_completion, the natural behavior when a crawler is already running is to wait for it — the end state is identical to starting a fresh run and waiting.

For update_crawler, I agree the case is slightly weaker (we're skipping a config update), but the config rarely changes between runs, and the next successful run will pick it up. Failing the whole task seems disproportionate.

An alternative design: we could add a fail_on_already_running: bool = False parameter to make this opt-in, if the team prefers a non-breaking-default approach. Happy to adjust!

@vincbeck
Copy link
Contributor

An alternative design: we could add a fail_on_already_running: bool = False parameter to make this opt-in, if the team prefers a non-breaking-default approach. Happy to adjust!

I would personally prefer this solution so that users can decide which behavior they want

When start_crawler() or update_crawler() is called while the crawler is already running (e.g., from a retry, overlapping DAG run, or boto3 internal retry after a timeout), the Glue API raises CrawlerRunningException. Previously this propagated as an unhandled error, causing Airflow task failure despite the crawler actually succeeding.

This change catches CrawlerRunningException on both update_crawler() and start_crawler() calls, logs a warning, and waits for the existing run to complete instead of failing.
@bahram-cdt bahram-cdt force-pushed the fix/handle-crawler-running-exception branch from 973cde4 to 50a2a37 Compare February 16, 2026 18:58
@bahram-cdt
Copy link
Contributor Author

An alternative design: we could add a fail_on_already_running: bool = False parameter to make this opt-in, if the team prefers a non-breaking-default approach. Happy to adjust!

I would personally prefer this solution so that users can decide which behavior they want

Agreed Now:

  • fail_on_already_running=True (default) — preserves current behavior, no change for existing users
  • fail_on_already_running=False — opt-in: catches CrawlerRunningException, logs a warning, and waits for the existing run to complete

Added tests covering both modes. Let me know if you'd like any further adjustments

@bahram-cdt bahram-cdt force-pushed the fix/handle-crawler-running-exception branch from 81e0141 to c2b2763 Compare February 16, 2026 20:24
@bahram-cdt bahram-cdt force-pushed the fix/handle-crawler-running-exception branch from 2c067cc to b0cc60a Compare February 16, 2026 22:31
@bahram-cdt
Copy link
Contributor Author

@vincbeck Fixed ruff failure and pushed again

…ue_crawler.py

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
@bahram-cdt bahram-cdt force-pushed the fix/handle-crawler-running-exception branch from 9fe246b to 2419757 Compare February 17, 2026 07:22
@vincbeck vincbeck merged commit e9b05f9 into apache:main Feb 17, 2026
89 checks passed
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 17, 2026

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

choo121600 pushed a commit to choo121600/airflow that referenced this pull request Feb 22, 2026
…che#62016)

* Handle CrawlerRunningException in GlueCrawlerOperator

When start_crawler() or update_crawler() is called while the crawler is already running (e.g., from a retry, overlapping DAG run, or boto3 internal retry after a timeout), the Glue API raises CrawlerRunningException. Previously this propagated as an unhandled error, causing Airflow task failure despite the crawler actually succeeding.

This change catches CrawlerRunningException on both update_crawler() and start_crawler() calls, logs a warning, and waits for the existing run to complete instead of failing.

* Update providers/amazon/src/airflow/providers/amazon/aws/operators/glue_crawler.py

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>

---------

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants