Update Europeana to ingest by timestamp_updated #1268
Labels
💻 aspect: code
Concerns the software code in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
good first issue
New-contributor friendly
help wanted
Open to participation from the community
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: catalog
Related to the catalog and Airflow DAGs
Current Situation
The Europeana DAG is a dated DAG which runs daily and ingests all records which were created on the previous day. This DAG is currently fairly sparse, with many days returning no data at all, while a small number of days have a very large amount of data. To accommodate these rare 'heavy' days, the timeout is 16 hours.
Suggested Improvement
According to the docs, it is possible to update this search to query by
timestamp_update
rather thantimestamp_create
:Making this change would mean that we could remove the Europeana reingestion workflow, as once a record is ingested any updates to it will be processed automatically as part of regular ingestion.
Benefit
In the current setup, days with a very large amount of data risk timing out during their initial ingestion and also every time they are reingested. These 'heavy' days are completely re-processed, likely unnecessarily if there are no updates to most records (we do not have popularity data for Europeana).
The text was updated successfully, but these errors were encountered: