Skip to content

One-off backfill script for Walters Museum data #1416

Open

Description

Description

We will need a one-off crawling script which will run against the data that we have stored in the catalog from the provider waltersartmuseum to acquire image height, width, filesize, and filetype information from each of the still-available images. This information will be used to update the records in the catalog, and is necessary for the data normalization steps we wish to perform (see #1545 and #1485).

This should be able to be accomplished using a polite crawler which visits all of the url values we have for Walters data. I don't believe we'll be able to use HEAD requests only since we will actually need the image data itself to determine height and width:

[ins] In [1]: import requests

[ins] In [2]: r = requests.head("https://static.thewalters.org/images/PS1_37.1513_Fnt_DD_T11.jpg")

[ins] In [4]: r.headers
Out[4]: {'Content-Length': '471138', 'Content-Type': 'image/jpeg', 'Last-Modified': 'Mon, 18 Nov 2013 23:36:26 GMT', 'Accept-Ranges': 'bytes', 'ETag': '"1cc97fffb6e4ce1:0"', 'Server': 'Microsoft-IIS/7.5', 'X-Powered-By': 'ASP.NET', 'Date': 'Wed, 05 Oct 2022 21:26:36 GMT'}

Additional context

This rationale for this decision can be found on our Make WP blog: Next steps for Walters Art Museum data.

Implementation

  • 🙋 I would be interested in implementing this feature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    • Status

      📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions