Skip to content

improve media tag processing for sql loads #1330

Open

Description

Current Situation

The inaturalist provider script uses some non-trivial SQL formulae to build json for tags, and the next SQL-only ingestion will need to do a similar thing unless we come up with something DRYer like we have in python. https://github.com/WordPress/openverse-catalog/blob/337ea7aede228609cbd5031e3a501f22b6ccc482/openverse_catalog/dags/common/storage/media.py#L265

Suggested Improvement

This is fine when there is just one SQL-oriented provider, but if/when we add more, the code will get unwieldy. Some imperfect ideas: make a separate table for tags (nice for structure and searching by tag, not nice for retrieving media rows central to existing operations), make a database function (nice for DRYness, not nice for switching costs when applied row by row for many many rows).

Another option might be declaring types, for example

create type openverse_tag as (name varchar(255), provider varchar(255));

And then just converting an array of openverse_tags to json directly with row_to_json or something?

Benefit

DRY more maintainable code, maybe more efficient?

Implementation

  • 🙋 I would be interested in implementing this feature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    ✨ goal: improvementImprovement to an existing user-facing feature💻 aspect: codeConcerns the software code in the repository🟩 priority: lowLow priority and doesn't need to be rushed🧱 stack: catalogRelated to the catalog and Airflow DAGs

    Type

    No type

    Projects

    • Status

      📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions