Description
Description
We recently had a Wikimedia provider script run failure due to the following:
[2022-09-17, 00:16:40 UTC] {dbapi.py:231} INFO - Running statement:
SELECT aws_s3.table_import_from_s3(
'provider_data_audio_wikimedia_commons_20220916T000000',
'',
'DELIMITER E'' ''',
'openverse-catalog',
'audio/wikimedia_commons/year=2022/month=09/wikimedia_audio_audio_v001_20220917000001.tsv',
'us-east-1'
);
, parameters: None
[2022-09-17, 00:16:40 UTC] {taskinstance.py:1909} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 171, in execute
return_value = self.execute_callable()
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 189, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/openverse_catalog/dags/common/loader/loader.py", line 45, in load_from_s3
loaded, missing_columns, foreign_id_dup = sql.load_s3_data_to_intermediate_table(
File "/usr/local/airflow/openverse_catalog/dags/common/loader/sql.py", line 156, in load_s3_data_to_intermediate_table
loaded = postgres.run(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/hooks/dbapi.py", line 211, in run
self._run_command(cur, sql_statement, parameters)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/hooks/dbapi.py", line 235, in _run_command
cur.execute(sql_statement)
psycopg2.errors.NumericValueOutOfRange: value "3381853454" is out of range for type integer
CONTEXT: COPY provider_data_audio_wikimedia_commons_20220916t000000, line 96, column filesize: "3381853454"
This unfortunately appears to be a legitimately large file size, as the upstream audio source is over 8 hours in length: https://commons.wikimedia.org/w/index.php?curid=123060206
This raises an interesting question for us though - should we except files that are larger than the integer maximum of 2147483647
bytes (or ~2GB)? Is that something we care to have indexed? If so, we'll need to perform an alter on this column in the catalog which could take quite some time. Alternatively we could simply reject records with file sizes this large, perhaps as part of the MediaStore
logic. @WordPress/openverse-catalog what do you think?
Update: Our intent now is to modify the column type to bigint
to allow values which exceed the current maximum. A proposal for how this could be done can be found on the Make WP blog.
Reproduction
- Set the
delay
attribute of theWikimediaCommonsDataIngester
class to0.1
diff --git a/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py b/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py index d73c9f2..c28b128 100644 --- a/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py +++ b/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py @@ -47,6 +47,7 @@ class WikimediaCommonsDataIngester(ProviderDataIngester): "audio": prov.WIKIMEDIA_AUDIO_PROVIDER, } endpoint = f"https://{HOST}/w/api.php" + delay = 0.1 headers = {"User-Agent": prov.UA_STRING} # The batch_limit applies to the number of pages received by the API, rather
just recreate
just shell
airflow dags backfill -s 2022-09-16 -e 2022-09-16 -v wikimedia_commons_workflow
Additional context
Resolution
- 🙋 I would be interested in resolving this bug.
Metadata
Assignees
Labels
Type
Projects
Status
📋 Backlog