Description
Problem
PR WordPress/openverse-api#474 introduced an approach to creating a pseudo-random subset by ordering the primary query on identifier
. Unfortunately, while I thought the index on identifier
would help out, it appears that the query still takes an incredibly long time to return results.
Description
We don't really care about true randomness or even an exact number of records selected, so we could potentially use an approach like this involving TABLESAMPLE_SYSTEM
to get a fast subset: https://stackoverflow.com/a/8675160/3277713. One thing to consider here is ensuring this is robust during integration testing and copies sufficient data in that case as well. It may be necessary to base the estimate off table count and provide a bare minimum number of rows.
Alternatives
Additional context
Implementation
- 🙋 I would be interested in implementing this feature.
Metadata
Assignees
Labels
Type
Projects
Status
🗑 Discarded