Skip to content

Changes to Catalog of Life data causing iNaturalist failure #4873

Open

Description

Airflow log link

Note: Airflow is currently only accessible to maintainers & those given
access. If you would like access to Airflow, please reach out to a member of
@WordPress/openverse-maintainers
.

https://airflow.openverse.org/dags/inaturalist_workflow/grid?dag_run_id=scheduled__2024-08-02T00%3A00%3A00%2B00%3A00&task_id=ingest_data.preingestion_tasks.load_catalog_of_life_names&base_date=2024-08-02T00%3A00%3A00%2B0000&tab=logs

Description

We're seeing the following issue with the iNaturalist DAG:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 762, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 733, in _execute_callable
    return ExecutionCallableRunner(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py", line 252, in run
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py", line 406, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 238, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 256, in execute_callable
    return runner.run(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py", line 252, in run
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/inaturalist.py", line 276, in load_catalog_of_life_names
    pg.copy_expert(COPY_SQL.format("col_vernacular"), OUTPUT_DIR / vernacular_file)
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/postgres/hooks/postgres.py", line 197, in copy_expert
    cur.copy_expert(sql, file)
psycopg2.errors.BadCopyFileFormat: extra data after last expected column
CONTEXT:  COPY col_vernacular, line 2: "49VSW		Longnose cusk-eel	Longnose cusk-eel	eng		US						"

This can follow the same process as #4711 for how to identify and address. Since it happens on line 2 (immediately after the header), there's likely been a schema change.

Reproduction

I'm able to reproduce this locally by running the inaturalist_workflow.

DAG status

Since this is a terminal failure, I've paused the DAG until this can be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    • Status

      📅 To Do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions