Description
openedon Apr 25, 2024
Description
Due to an upstream failure tracked in #4013, Science Museum occasionally fails. We are running the DAG in production with SKIPPED_INGESTION_ERRORS
skipping 503s to allow the DAG to complete.
However in the latest production run, this did not work as expected. When the batch with the 503 error is reached, the logs indicate that the batch was successfully skipped -- but ingestion also halts immediately afterward, instead of moving on to the next batch:
[2024-04-18, 02:07:26 UTC] {provider_data_ingester.py:270} ERROR - Skipping batch due to ingestion error: 503 Server Error: Service Unavailable for url: https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page%5Bsize%5D=100&page%5Bnumber%5D=43&date%5Bfrom%5D=1500&date%5Bto%5D=1750
[2024-04-18, 02:07:31 UTC] {provider_data_ingester.py:244} INFO - Batch complete.
[2024-04-18, 02:07:31 UTC] {media.py:237} INFO - Writing 11 lines from buffer to disk.
[2024-04-18, 02:07:31 UTC] {provider_data_ingester.py:513} INFO - Committed 12982 records
This is a concern because it means that the provider stops ingesting after records dated to 1750 (so, it doesn't reach the vast majority of the records). This is high priority because we need a full ingestion run of this provider in order to fix data that has been broken by recent upstream changes. including the URLs.
Metadata
Assignees
Labels
Type
Projects
Status
✅ Done