Skip to content

Handle duplication of records between auckland_museum and wikimedia #3659

Open

Description

Problem

As noted by @sarayourfriend in this comment, many records from the Auckland Museum's collection are already in Openverse due to their inclusion in Wikimedia Commons. If we run both DAGs and do nothing to address this, these records will be duplicated in Openverse.

Description

Suggestion taken directly from Sara's comment:

Either we'd need to suppress the entries from Wikimedia Commons, or, (probably my preference) improve our ingestion of Wikimedia Commons to be able to identify sources like this in Wikimedia Commons. Glancing at the Wikimedia Commons provider script, I don't think we currently save the "collections" metadata present in the file summary on the Wikimedia Commons page.

I think this is a big opportunity to expand the list of high quality sources without introducing duplicates, and while cleaning up the Wikimedia Commons data ingestion, cleanup, and overall handling. For this institution in particular, there is a great page describing how the metadata is structured: https://commons.wikimedia.org/wiki/Commons:Batch_uploading/AucklandMuseumCCBY

The same information would also be relevant for the National Gallery of Art (#3167) (see this Wikimedia result, which is in Openverse with similarly poorly handled metadata and is in a NGA collection in Wikimedia Commons's data). I imagine there are a handful of other such institutions that we could add, just by improving the Wikimedia Commons script and our handling of their data.

And actually, when digging through Wikimedia Commons and Wikidata pages researching this comment, I found this amazing spreadsheet that would help us identify these exact kinds of institutions, for Wikimedia Commons, Europeana, Flickr, and even TROVE (#2653): https://docs.google.com/spreadsheets/d/1WPS-KJptUJ-o8SXtg00llcxq0IKJu8eO6Ege_GrLaNc/edit#gid=1216556120

Additional context

The auckland_museum DAG is currently blocked on other issues (see DAG Status page), but this issue should not necessarily prevent us from turning the DAG on.

However, we should not add the provider as a source in the API until this has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    ✨ goal: improvementImprovement to an existing user-facing feature💻 aspect: codeConcerns the software code in the repository🟨 priority: mediumNot blocking but should be addressed soon🧱 stack: catalogRelated to the catalog and Airflow DAGs

    Type

    No type

    Projects

    • Status

      📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions