Skip to content

Make popularity refresh schedule configurable by provider #3051

Open

Description

Problem

Just a thought -- currently popularity refreshes run by media type, once a month for all providers of that type. By definition a popularity refresh recalculates the popularity constants, and then updates all the standardized popularity scores using that new value. The reason the constant needs to be updated is that it's calculated based on a percentile value of the raw scores (e.g. "the 80th percentile number of views for images on Flickr"), and this number necessarily changes as more records are ingested.

It seems reasonable to expect that as we ingest more records for a provider, the percentile value (and thus, the constant) will change less and less. A newly added provider's popularity score may swing wildly as we backfill data (or, as the provider itself grows), but a very large, established provider with Openverse is less likely to do so. On a related note, the larger a provider is within Openverse, the longer its batched update takes to run.

So these things go hand in hand: a large provider like Flickr takes a long time to update, so we only want to do it monthly. But we're also okay with that, since we expect the constant to change very little. Conversely, a smaller provider is super fast to update, and its constant might change much more dramatically.

Description

It would be really nice if we could configure a different refresh schedule for each provider. Totally just throwing out the first idea I had here, so we'll want to consider other options, but we could:

  • Add some kind of schedule column to the <media>_popularity_metrics table, so each provider defines its own schedule there
    • Alternatively, in the DAG decide the schedule based on the number of records each provider has
  • Update the popularity refresh DAGs to run weekly, but only run the refresh for large providers like Flickr if it's the first run of the month
    • Would be nice to also have some kind of force_full_refresh flag to let us force a full refresh any time during the month.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    ✨ goal: improvementImprovement to an existing user-facing feature💻 aspect: codeConcerns the software code in the repository🟩 priority: lowLow priority and doesn't need to be rushed🧱 stack: catalogRelated to the catalog and Airflow DAGs

    Type

    No type

    Projects

    • Status

      📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions