Description
openedon Jan 12, 2024
Problem
As noted by @sarayourfriend in this comment, many records from the Auckland Museum's collection are already in Openverse due to their inclusion in Wikimedia Commons. If we run both DAGs and do nothing to address this, these records will be duplicated in Openverse.
Description
Suggestion taken directly from Sara's comment:
Either we'd need to suppress the entries from Wikimedia Commons, or, (probably my preference) improve our ingestion of Wikimedia Commons to be able to identify sources like this in Wikimedia Commons. Glancing at the Wikimedia Commons provider script, I don't think we currently save the "collections" metadata present in the file summary on the Wikimedia Commons page.
I think this is a big opportunity to expand the list of high quality sources without introducing duplicates, and while cleaning up the Wikimedia Commons data ingestion, cleanup, and overall handling. For this institution in particular, there is a great page describing how the metadata is structured: https://commons.wikimedia.org/wiki/Commons:Batch_uploading/AucklandMuseumCCBY
The same information would also be relevant for the National Gallery of Art (#3167) (see this Wikimedia result, which is in Openverse with similarly poorly handled metadata and is in a NGA collection in Wikimedia Commons's data). I imagine there are a handful of other such institutions that we could add, just by improving the Wikimedia Commons script and our handling of their data.
And actually, when digging through Wikimedia Commons and Wikidata pages researching this comment, I found this amazing spreadsheet that would help us identify these exact kinds of institutions, for Wikimedia Commons, Europeana, Flickr, and even TROVE (#2653): https://docs.google.com/spreadsheets/d/1WPS-KJptUJ-o8SXtg00llcxq0IKJu8eO6Ege_GrLaNc/edit#gid=1216556120
Additional context
The auckland_museum
DAG is currently blocked on other issues (see DAG Status page), but this issue should not necessarily prevent us from turning the DAG on.
However, we should not add the provider as a source in the API until this has been resolved.
Metadata
Assignees
Labels
Type
Projects
Status
📋 Backlog