Description
openedon Sep 21, 2023
Problem
Just a thought -- currently popularity refreshes run by media type, once a month for all providers of that type. By definition a popularity refresh recalculates the popularity constants, and then updates all the standardized popularity scores using that new value. The reason the constant needs to be updated is that it's calculated based on a percentile value of the raw scores (e.g. "the 80th percentile number of views for images on Flickr"), and this number necessarily changes as more records are ingested.
It seems reasonable to expect that as we ingest more records for a provider, the percentile value (and thus, the constant) will change less and less. A newly added provider's popularity score may swing wildly as we backfill data (or, as the provider itself grows), but a very large, established provider with Openverse is less likely to do so. On a related note, the larger a provider is within Openverse, the longer its batched update takes to run.
So these things go hand in hand: a large provider like Flickr takes a long time to update, so we only want to do it monthly. But we're also okay with that, since we expect the constant to change very little. Conversely, a smaller provider is super fast to update, and its constant might change much more dramatically.
Description
It would be really nice if we could configure a different refresh schedule for each provider. Totally just throwing out the first idea I had here, so we'll want to consider other options, but we could:
- Add some kind of schedule column to the
<media>_popularity_metrics
table, so each provider defines its own schedule there- Alternatively, in the DAG decide the schedule based on the number of records each provider has
- Update the popularity refresh DAGs to run weekly, but only run the refresh for large providers like Flickr if it's the first run of the month
- Would be nice to also have some kind of
force_full_refresh
flag to let us force a full refresh any time during the month.
- Would be nice to also have some kind of
Metadata
Assignees
Labels
Type
Projects
Status
📋 Backlog