Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase Metropolitan reingestion timeout #1293

Open
stacimc opened this issue Jan 31, 2023 · 1 comment
Open

Increase Metropolitan reingestion timeout #1293

stacimc opened this issue Jan 31, 2023 · 1 comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow

Comments

@stacimc
Copy link
Contributor

stacimc commented Jan 31, 2023

Description

We have a number of failures in the metropolitan_reingestion_workflow caused by AirflowTaskTimeouts during the pull_data step.

Per context in this comment, it's possible this is being caused by the entire DAG timing out, rather than the individual task. We should investigate to make sure.

Possible fixes:

@stacimc stacimc added good first issue New-contributor friendly help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🔧 tech: airflow Involves Apache Airflow and removed good first issue New-contributor friendly help wanted Open to participation from the community labels Jan 31, 2023
@stacimc
Copy link
Contributor Author

stacimc commented Jan 31, 2023

While investigating this, I noticed that the Metropolitan reingestion flow is exceeding its dagrun_timeout of 23 hours and skipping many of its reingestion tasks. I was about to make a separate issue for this, but after looking a little closer I think it may actually be the cause of this issue.

The task timeout for the pull_data task for Met is currently 16 hours, but glancing through the logs for the most recent pull_data timeout failures show that they timed out after only 12 hours (although 16 hrs is correctly configured in the Task Instance Details). I think it's possible the tasks are timing out early when the entire DAG times out at 23 hours. We should be able to verify this, but I've removed the help wanted label as this will require investigating the logs in production.

Since the metropolitan reingestion workflow is run @weekly, we can increase the dagrun timeout. Alternatively we should consider reducing the number of reingestion tasks so that it can be completed within the 23hour timeframe, and then update the schedule to @daily.

@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants