Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Science Museum ingester with API changes #4105

Merged
merged 5 commits into from
Apr 17, 2024

Conversation

stacimc
Copy link
Collaborator

@stacimc stacimc commented Apr 12, 2024

Fixes

Fixes #4092 by @AetherUnbound

Description

Updates the ScienceMuseum ingester class to work with the changed API.

Testing Instructions

Tests should pass. Run the Science Museum DAG locally and observe that records are ingested.

I also downloaded the tsv from MinIO and compared it to the last pre-changes production tsv from January to make sure the data looks good.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 12, 2024
@stacimc stacimc self-assigned this Apr 12, 2024
@stacimc stacimc requested a review from a team as a code owner April 12, 2024 21:25
@stacimc stacimc requested review from krysal and obulat April 12, 2024 21:25
Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks so much for jumping on that so quickly! I was able to run an ingestion locally, however it looks like the URLs that we ingested are all giving me AccessDenied errors 😞 here's a sample:

openledger> select title, url, foreign_landing_url from image where provider = 'sciencemuseum' limit 10;
+------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------------------------------------------------------
-----------------------------------------+
| title                                          | url                                                                                | foreign_landing_url                                                         
                                         |
|------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------------------------------------------------------
-----------------------------------------|
| Excavated neolithic flint scraper              | https://coimages.sciencemuseumgroup.org.uk/images/330/750/large_a634842__0003_.jpg | https://collection.sciencemuseumgroup.org.uk/objects/co106398/excavated-neol
ithic-flint-scraper-scrapers             |
| Votive intestine                               | https://coimages.sciencemuseumgroup.org.uk/images/709/335/large_a635751__0005_.jpg | https://collection.sciencemuseumgroup.org.uk/objects/co83260/votive-intestin
e-votive-viscera                         |
| Roughly cylindrical sandstone mortar           | https://coimages.sciencemuseumgroup.org.uk/images/442/761/large_smg00201374.jpg    | https://collection.sciencemuseumgroup.org.uk/objects/co131060/roughly-cylind
rical-sandstone-mortar-mortars           |
| Votive right hand                              | https://coimages.sciencemuseumgroup.org.uk/images/458/333/large_a73036__0002_.jpg  | https://collection.sciencemuseumgroup.org.uk/objects/co82968/votive-right-ha
nd-votive-hand                           |
| Cautery, bronze, Roman, from Sforza collection | https://coimages.sciencemuseumgroup.org.uk/images/347/896/large_smg00190800.jpg    | https://collection.sciencemuseumgroup.org.uk/objects/co87137/cautery-bronze-
roman-from-sforza-collection-cautery     |
| Bronze coin                                    | https://coimages.sciencemuseumgroup.org.uk/images/661/559/smg00015483__0001_.jpg   | https://collection.sciencemuseumgroup.org.uk/objects/co83841/bronze-coin-coi
ns                                       |
| Glass unguent bottle, Roman, 151 to 300 AD     | https://coimages.sciencemuseumgroup.org.uk/images/362/329/large_smg00187927.jpg    | https://collection.sciencemuseumgroup.org.uk/objects/co90128/glass-unguent-b
ottle-roman-151-to-300-ad-unguent-bottle |
| Probe with flat end and olive end, bronze      | https://coimages.sciencemuseumgroup.org.uk/images/347/880/large_smg00190784.jpg    | https://collection.sciencemuseumgroup.org.uk/objects/co88326/probe-with-flat
-end-and-olive-end-bronze-probe-medical  |
| Votive heart(?), terracotta, probably Roman    | https://coimages.sciencemuseumgroup.org.uk/images/458/341/large_a635759__0001_.jpg | https://collection.sciencemuseumgroup.org.uk/objects/co83268/votive-heart-te
rracotta-probably-roman-votive-viscera   |
| Votive placenta                                | https://coimages.sciencemuseumgroup.org.uk/images/237/659/large_a114889__0001_.jpg | https://collection.sciencemuseumgroup.org.uk/objects/co83676/votive-placenta
-votive-viscera                          |
+------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------------------------------------------------------
-----------------------------------------+

@stacimc stacimc marked this pull request as draft April 15, 2024 22:38
@stacimc
Copy link
Collaborator Author

stacimc commented Apr 15, 2024

@AetherUnbound It looks like the url format changed, so it was breaking only in the instances where we were building the full url ourselves. This means most of our production URLs are currently broken as well, which I was able to confirm 😬 However, since this is not dated we should just be able to just run the DAG and pick up the fixes on the next data refresh.

However I noticed #4013 again while testing this locally, so I've reopened that issue.

@stacimc stacimc marked this pull request as ready for review April 15, 2024 23:23
Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed the new URLs work!

@obulat
Copy link
Contributor

obulat commented Apr 17, 2024

Thank you for working on this, I'll be testing the PR locally now. Just wanted to confirm that the images are currently broken in prod:
Screenshot 2024-04-17 at 6 28 28 PM
:(

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good catch on the failed URLs!

I left other suggestions but am not blocking on them since this is mainly aimed at reactivating the DAG and the PR as it is achieve it.

Comment on lines +157 to +160
if not (maker := attributes.get("creation", {}).get("maker", [])):
return None

return maker[0].get("summary", {}).get("title", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of "Unknown maker" values in the creator column (the test data is a good example), which I think would be more accurate to leave them as NULL instead.

Suggested change
if not (maker := attributes.get("creation", {}).get("maker", [])):
return None
return maker[0].get("summary", {}).get("title", None)
if not (maker := attributes.get("creation", {}).get("maker", [])):
return None
creator = maker[0].get("summary", {}).get("title", None)
return creator if creator != "Unknown maker" else None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, because most other creators are also unknown here :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I can't decide -- theoretically there could be a difference between "no maker information was provided by the source" and "an authoritative (museum) source confirmed the maker is unknown". But maybe that's not a useful distinction. I would be curious if we do something similar for any of our other sources 🤔

I'll make a separate issue for this, mostly because it looks like we do have "Unknown maker" in production data at the moment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4145 created!

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, and could see the ingested images locally again:
Screenshot 2024-04-17 at 7 00 52 PM

@stacimc stacimc merged commit b5a1ae0 into main Apr 17, 2024
39 checks passed
@stacimc stacimc deleted the update/science-museum-api branch April 17, 2024 19:00
@stacimc
Copy link
Collaborator Author

stacimc commented Apr 17, 2024

I'm going to re-enable the science museum dag and kick off a new dagrun if one doesn't start automatically. I'll also make an issue to reenable the provider once we've had a full Dagrun to succeed and a data refresh to complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Update Science Museum DAG to use new API response format
4 participants