Description
Problem
The iha provider is 100% dead and over 1 million works, and there is no reason to include it in the index. Yet, every single data refresh and filtered index creation has to processes these dead records and every single query needs to spend extra time excluding them.
Description
Add the ability to filter providers at the data refresh level such that they never get imported from the catalogue database into the API database, and therefore do not get indexed. There may be other providers we should apply this to (500px is filtered at the API level, but I can't remember if that's because it has a high overall number of dead links or if it is 100% dead links like iha, which doesn't exist anymore). For now we would just use it for iha.
Alternatives
Delete iha data from the catalogue. We're not in the business of archiving this data, so maybe it's better to just remove it from the catalogue database altogether. @WordPress/openverse-maintainers, which do y'all think is best? Dynamically filtering at data refresh time or just ditching the data altogether? To clarify, iha no longer exists and all results lead to scammy crypto casino domain parking pages (they also don't appear in staging because the provider is filtered, the issue isn't for end-users directly, just our data quality and health).
Metadata
Assignees
Labels
Type
Projects
Status
✅ Done