Skip to content

Filesize exceeds postgres integer column maximum size #1583

Open

Description

Description

We recently had a Wikimedia provider script run failure due to the following:

[2022-09-17, 00:16:40 UTC] {dbapi.py:231} INFO - Running statement: 
SELECT aws_s3.table_import_from_s3(
  'provider_data_audio_wikimedia_commons_20220916T000000',
  '',
  'DELIMITER E''	''',
  'openverse-catalog',
  'audio/wikimedia_commons/year=2022/month=09/wikimedia_audio_audio_v001_20220917000001.tsv',
  'us-east-1'
);
, parameters: None
[2022-09-17, 00:16:40 UTC] {taskinstance.py:1909} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 171, in execute
    return_value = self.execute_callable()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 189, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/openverse_catalog/dags/common/loader/loader.py", line 45, in load_from_s3
    loaded, missing_columns, foreign_id_dup = sql.load_s3_data_to_intermediate_table(
  File "/usr/local/airflow/openverse_catalog/dags/common/loader/sql.py", line 156, in load_s3_data_to_intermediate_table
    loaded = postgres.run(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/hooks/dbapi.py", line 211, in run
    self._run_command(cur, sql_statement, parameters)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/hooks/dbapi.py", line 235, in _run_command
    cur.execute(sql_statement)
psycopg2.errors.NumericValueOutOfRange: value "3381853454" is out of range for type integer
CONTEXT:  COPY provider_data_audio_wikimedia_commons_20220916t000000, line 96, column filesize: "3381853454"

This unfortunately appears to be a legitimately large file size, as the upstream audio source is over 8 hours in length: https://commons.wikimedia.org/w/index.php?curid=123060206

This raises an interesting question for us though - should we except files that are larger than the integer maximum of 2147483647 bytes (or ~2GB)? Is that something we care to have indexed? If so, we'll need to perform an alter on this column in the catalog which could take quite some time. Alternatively we could simply reject records with file sizes this large, perhaps as part of the MediaStore logic. @WordPress/openverse-catalog what do you think?

Update: Our intent now is to modify the column type to bigint to allow values which exceed the current maximum. A proposal for how this could be done can be found on the Make WP blog.

Reproduction

  1. Set the delay attribute of the WikimediaCommonsDataIngester class to 0.1
diff --git a/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py b/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
index d73c9f2..c28b128 100644
--- a/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
+++ b/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
@@ -47,6 +47,7 @@ class WikimediaCommonsDataIngester(ProviderDataIngester):
         "audio": prov.WIKIMEDIA_AUDIO_PROVIDER,
     }
     endpoint = f"https://{HOST}/w/api.php"
+    delay = 0.1
     headers = {"User-Agent": prov.UA_STRING}
 
     # The batch_limit applies to the number of pages received by the API, rather
  1. just recreate
  2. just shell
  3. airflow dags backfill -s 2022-09-16 -e 2022-09-16 -v wikimedia_commons_workflow

Additional context

Resolution

  • 🙋 I would be interested in resolving this bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    • Status

      📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions