Report reingestion errors in aggregate #4074

stacimc · 2024-04-08T23:51:18Z

Fixes

Description

When there is a problem with a provider script in a reingestion workflow, it's fairly common for the error to occur for many different reingestion dates and flood the Slack alerts channel with a message for each error. This PR prevents this by:

disabling the slack on_failure_callback for the pull_data task of reingestion workflows only
adding a report_aggregate_reingestion_errors task to the end of reingestion workflows that checks to see if there were any failed pull_data tasks, and if so sends a message to the Slack alerts channel with a list of links to the log files for failed tasks.

This is a simpler approach than what was originally proposed in the issue, and does not require modifying the ProviderDataIngester. Most of the changes in this PR are just updating the variable name conf -> provider_conf everywhere in the DAG factory, to distinguish the ProviderWorkflow conf object from Airflow's built-in DAG conf (which was necessary in order to pass dag conf through taskflow).

Testing Instructions

Easiest way to test involves some quick modifications to a reingestion workflow. I modified Phylopic by first reducing the number of reingestion days so it wouldn't take so long. In https://github.com/WordPress/openverse/blob/d57b1d4526975dbdaf9e86aa3ba44b27174b6b96/catalog/dags/providers/provider_reingestion_workflows.py:

    ProviderReingestionWorkflow(
        # 64 total reingestion days
        ingester_class=PhylopicDataIngester,
        max_active_tasks=2,
        pull_timeout=timedelta(hours=12),
-        daily_list_length=6,
+        daily_list_length=2,
-        one_month_list_length=9,
-        three_month_list_length=18,
-        six_month_list_length=30,
    ),

Then I ran the phylopic_reingestion_workflow, which now has only 3 reingestion dates. At each pull_data, I manually marked the task as Success as soon as it started running, then verified that the new report_aggregate_reingestion_errors task was skipped.

Then I modified the Phylopic script to raise an error:

    def get_batch_data(self, response_json):
+        raise ValueError("Oops!")
-        return response_json.get("_embedded", {}).get("items", [])

And re-ran the phylopic_reingestion_workflow DAG. This time you should see all three pull_data tasks fail. When you inspect the logs for these tasks you should see the error message and stack trace, but you should not see the message being reported to Slack (with a log like Using connection ID 'slack_alerts' for task execution.). You should then see that the report_aggregate_reingestion_errors task has not skipped, and when you check the logs you should see it reporting something like:

[2024-04-08, 23:23:55 UTC] {slack.py:329} INFO - Ingestion errors were encountered in 3 ingestion days while running the `phylopic_reingestion_workflow` DAG. See the following logs for details:
  - <http://localhost:8080/log?execution_date=2024-04-08T23%3A23%3A27.614959%2B00%3A00&task_id=ingest_data_day_shift_1.pull_image_data_day_shift_1&dag_id=phylopic_reingestion_workflow&map_index=-1 | ingest_data_day_shift_1.pull_image_data_day_shift_1>
  - <http://localhost:8080/log?execution_date=2024-04-08T23%3A23%3A27.614959%2B00%3A00&task_id=ingest_data.pull_image_data&dag_id=phylopic_reingestion_workflow&map_index=-1 | ingest_data.pull_image_data>
  - <http://localhost:8080/log?execution_date=2024-04-08T23%3A23%3A27.614959%2B00%3A00&task_id=ingest_data_day_shift_2.pull_image_data_day_shift_2&dag_id=phylopic_reingestion_workflow&map_index=-1 | ingest_data_day_shift_2.pull_image_data_day_shift_2>

Also try running a regular (non-reingestion) provider DAG to ensure there were no issues with the variable renaming.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

This reverts commit dd86d9e.

…aggregate task

krysal

I tested the instructions, and it works great! Glad that there was an easier path ⭐

catalog/dags/providers/provider_dag_factory.py

AetherUnbound

LGTM, tested locally and it worked as expected! Two notes, nothing to block a merge though. Thanks for jumping on this!

catalog/dags/providers/provider_dag_factory.py

AetherUnbound · 2024-04-11T17:18:14Z

catalog/dags/providers/provider_dag_factory.py

+        f" ingestion days while running the `{provider_conf.dag_id}` DAG. See the following"
+        " logs for details:\n"
+    ) + "\n".join(
+        f"  - <{task.log_url}|{task.task_id}>" for task in failed_pull_data_tasks


It would be really cool if we could see the actual exceptions themselves here as well...but I think that would require a lot more code to pull the context or something similar. Ah well, way better than 100s of messages with the same content!

Agreed, but yeah I think that would require more work -- and decisions about how to present that in the message without visually cluttering it up again, particularly if not all of the tasks have the same error or whatever.

That would be a really nice addition though for the future... maybe we could have a longer report that groups failed tasks by their error, that's only logged in the aggregate reporting step (rather than being part of the slack message)... 🤔

Oooo, that would be such a cool idea! But yea, not necessary here, just pondering!

stacimc added 4 commits April 5, 2024 16:11

Add is_reingestion context to Ingester class

dd86d9e

Revert "Add is_reingestion context to Ingester class"

f2a7b3f

This reverts commit dd86d9e.

Do not report to Slack during pull_data task for reingestion, but in …

b08aaa2

…aggregate task

Fix formatting

128bb02

stacimc added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 8, 2024

stacimc self-assigned this Apr 8, 2024

stacimc requested a review from a team as a code owner April 8, 2024 23:51

stacimc requested review from krysal and AetherUnbound April 8, 2024 23:51

Make log links more compact

d6e9b7d

krysal approved these changes Apr 11, 2024

View reviewed changes

catalog/dags/providers/provider_dag_factory.py Show resolved Hide resolved

AetherUnbound approved these changes Apr 11, 2024

View reviewed changes

Report at most 5 failed task logs

9d4736d

stacimc merged commit 079bf0e into main Apr 15, 2024
41 checks passed

stacimc deleted the update/report-reingestion-errors-in-aggregate branch April 15, 2024 20:38

stacimc mentioned this pull request Apr 15, 2024

Improve aggregate reingestion reporting #4121

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report reingestion errors in aggregate #4074

Report reingestion errors in aggregate #4074

stacimc commented Apr 8, 2024 •

edited

Loading

krysal left a comment

AetherUnbound left a comment

AetherUnbound Apr 11, 2024

stacimc Apr 12, 2024

AetherUnbound Apr 12, 2024

Report reingestion errors in aggregate #4074

Report reingestion errors in aggregate #4074

Conversation

stacimc commented Apr 8, 2024 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

krysal left a comment

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

AetherUnbound Apr 11, 2024

Choose a reason for hiding this comment

stacimc Apr 12, 2024

Choose a reason for hiding this comment

AetherUnbound Apr 12, 2024

Choose a reason for hiding this comment

stacimc commented Apr 8, 2024 •

edited

Loading