Description
Airflow log link
Note: Airflow is currently only accessible to maintainers & those given
access. If you would like access to Airflow, please reach out to a member of
@WordPress/openverse-maintainers.
[2024-04-01, 00:03:38 UTC] {requester.py:85} ERROR - Error with the request for URL: https://collection.sciencemuseumgroup.org.uk/search/
[2024-04-01, 00:03:38 UTC] {requester.py:86} INFO - HTTPError: 503 Server Error: Service Unavailable for url: https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page%5Bsize%5D=100&page%5Bnumber%5D=39&date%5Bfrom%5D=0&date%5Bto%5D=200
[2024-04-01, 00:03:38 UTC] {requester.py:88} INFO - Using query parameters {'has_image': 1, 'image_license': 'CC', 'page[size]': 100, 'page[number]': 39, 'date[from]': 0, 'date[to]': 200}
[2024-04-01, 00:03:38 UTC] {requester.py:89} INFO - Using headers {'User-Agent': 'Openverse/0.1 (https://openverse.org; openverse@wordpress.org)', 'Accept': 'application/json'}
[2024-04-01, 00:03:38 UTC] {requester.py:154} ERROR - No retries remaining. Failure.
[2024-04-01, 00:03:38 UTC] {provider_data_ingester.py:513} INFO - Committed 0 records
[2024-04-01, 00:03:39 UTC] {taskinstance.py:2728} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: 503 Server Error: Service Unavailable for url: https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page%5Bsize%5D=100&page%5Bnumber%5D=39&date%5Bfrom%5D=0&date%5Bto%5D=200
query_params: {"has_image": 1, "image_license": "CC", "page[size]": 100, "page[number]": 39, "date[from]": 0, "date[to]": 200}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
return execute_callable(context=context, **execute_callable_kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 200, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 217, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/catalog/dags/providers/factory_utils.py", line 55, in pull_media_wrapper
data = ingester.ingest_records()
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/science_museum.py", line 81, in ingest_records
super().ingest_records(year_range=year_range)
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 276, in ingest_records
raise error from ingestion_error
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 238, in ingest_records
batch, should_continue = self.get_batch(query_params)
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 400, in get_batch
response_json = self.get_response_json(query_params)
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 421, in get_response_json
return self.delayed_requester.get_response_json(
File "/opt/airflow/catalog/dags/common/requester.py", line 202, in get_response_json
response_json = self._attempt_retry_get_response_json(
File "/opt/airflow/catalog/dags/common/requester.py", line 165, in _attempt_retry_get_response_json
return self.get_response_json(
File "/opt/airflow/catalog/dags/common/requester.py", line 202, in get_response_json
response_json = self._attempt_retry_get_response_json(
File "/opt/airflow/catalog/dags/common/requester.py", line 165, in _attempt_retry_get_response_json
return self.get_response_json(
File "/opt/airflow/catalog/dags/common/requester.py", line 202, in get_response_json
response_json = self._attempt_retry_get_response_json(
File "/opt/airflow/catalog/dags/common/requester.py", line 165, in _attempt_retry_get_response_json
return self.get_response_json(
File "/opt/airflow/catalog/dags/common/requester.py", line 202, in get_response_json
response_json = self._attempt_retry_get_response_json(
File "/opt/airflow/catalog/dags/common/requester.py", line 155, in _attempt_retry_get_response_json
raise error
File "/opt/airflow/catalog/dags/common/requester.py", line 181, in get_response_json
response = self.get(endpoint, params=query_params, **kwargs)
File "/opt/airflow/catalog/dags/common/requester.py", line 103, in get
return self._make_request(self.session.get, url, params=params, **kwargs)
File "/opt/airflow/catalog/dags/common/requester.py", line 70, in _make_request
response.raise_for_status()
File "/home/airflow/.local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page%5Bsize%5D=100&page%5Bnumber%5D=39&date%5Bfrom%5D=0&date%5Bto%5D=200
Description
It appears as though the Science Museum DAG is failing for this particular URL (specifically, these parameters):
Reproduction
Changing the page[number]
param from 39 to 40 returns a non-503 response:
Since this is entirely an upstream bug, I think the best case here might be to skip a particular page if we receive a 503 response specifically.
Note
We should take special care to make sure that when this issue is resolved, we're actually ingesting data from this provider. The last large run returned nearly 100k results, but our previous run prior to this failure only returned ~150. There may be another issue here which is preventing standard ingestion of records, possibly due to a change in the shape of results.
DAG status
No change, this is a monthly DAG and we should hopefully address it soon.
Metadata
Assignees
Labels
Type
Projects
Status
⛔ Blocked