Skip to content

Ability to exclude providers from data refresh #3316

Closed as not planned
Closed as not planned

Description

Problem

The iha provider is 100% dead and over 1 million works, and there is no reason to include it in the index. Yet, every single data refresh and filtered index creation has to processes these dead records and every single query needs to spend extra time excluding them.

Description

Add the ability to filter providers at the data refresh level such that they never get imported from the catalogue database into the API database, and therefore do not get indexed. There may be other providers we should apply this to (500px is filtered at the API level, but I can't remember if that's because it has a high overall number of dead links or if it is 100% dead links like iha, which doesn't exist anymore). For now we would just use it for iha.

Alternatives

Delete iha data from the catalogue. We're not in the business of archiving this data, so maybe it's better to just remove it from the catalogue database altogether. @WordPress/openverse-maintainers, which do y'all think is best? Dynamically filtering at data refresh time or just ditching the data altogether? To clarify, iha no longer exists and all results lead to scammy crypto casino domain parking pages (they also don't appear in staging because the provider is filtered, the issue isn't for end-users directly, just our data quality and health).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    • Status

      ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions