Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Europeana to ingest by timestamp_updated #1268

Closed
stacimc opened this issue Mar 20, 2023 · 1 comment · Fixed by #2817
Closed

Update Europeana to ingest by timestamp_updated #1268

stacimc opened this issue Mar 20, 2023 · 1 comment · Fixed by #2817
Assignees
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature good first issue New-contributor friendly help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@stacimc
Copy link
Contributor

stacimc commented Mar 20, 2023

Current Situation

The Europeana DAG is a dated DAG which runs daily and ingests all records which were created on the previous day. This DAG is currently fairly sparse, with many days returning no data at all, while a small number of days have a very large amount of data. To accommodate these rare 'heavy' days, the timeout is 16 hours.

Suggested Improvement

According to the docs, it is possible to update this search to query by timestamp_update rather than timestamp_create:

Syntax: timestamp_update:"2013-03-16T20:26:27.168Z"

Making this change would mean that we could remove the Europeana reingestion workflow, as once a record is ingested any updates to it will be processed automatically as part of regular ingestion.

Benefit

In the current setup, days with a very large amount of data risk timing out during their initial ingestion and also every time they are reingested. These 'heavy' days are completely re-processed, likely unnecessarily if there are no updates to most records (we do not have popularity data for Europeana).

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Mar 20, 2023
@zackkrida
Copy link
Member

Very cool discovery.

@AetherUnbound AetherUnbound added help wanted Open to participation from the community good first issue New-contributor friendly labels Mar 23, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@rwidom rwidom self-assigned this Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature good first issue New-contributor friendly help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants