Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Science museum urls #4276

Merged
merged 7 commits into from
May 9, 2024
Merged

Update Science museum urls #4276

merged 7 commits into from
May 9, 2024

Conversation

stacimc
Copy link
Collaborator

@stacimc stacimc commented May 7, 2024

Fixes

Fixes #4261 by @stacimc

Description

Hopefully this should be the final change needed to get Science Museum back in shape after the API updates in #4105.

The problem is that even though we're now formatting URLs properly:

  • We have about 20k records that are not being reingested (possibly because we have to skip some batches due to an error with the provider API), and whose URLs are therefore not reformatted
  • Even when the URLs are reformatted, some portion of those continue to 403

We don't want to ingest records with broken links. So this PR does two things:

  • Updates the Science Museum DAG to discard records whose URL cannot be reached
  • Adds a one-time maintenance DAG which will be used to update and validate all of our existing records, including ones that are not being picked up by ingestion
Screenshot 2024-05-09 at 12 11 36 PM

This DAG will go back and correctly format all existing urls, and then check to see if the url is reachable. Since I don't know how many records will turn out to be unreachable, I have not opted to make the DAG automatically delete these records, just record their identifiers in a new table. Once the DAG runs, we'll be able to see how many are invalid and either investigate further or manually delete them using the recorded ids.

Testing Instructions

Run the Science Museum DAG and ensure that it still works. Let it ingest a few hundred records and then mark it as a success.

To test the maintenance DAG we will need to make sure we have at least one record which is invalid. You can do this manually by running just catalog/pgcli and then:

-- Update a record's URL to a url that is incorrectly formatted (contains '/images/'), AND
-- is known to 403 even when formatted correctly.
UPDATE image
SET url='https://coimages.sciencemuseumgroup.org.uk/images/206/868/large_cd0568_011_091005_2009_43_2_NEC_mobile_phone.jpg'
WHERE identifier IN (
   SELECT identifier
   FROM image
   WHERE provider='sciencemuseum'
   LIMIT 1
);

Then run the update_science_museum_urls DAG and verify that everything passes and one record id is added to the science_museum_invalid_ids table.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (just catalog/generate-docs for catalog
    PRs) or the media properties generator (just catalog/generate-docs media-props
    for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels May 7, 2024
@stacimc stacimc self-assigned this May 7, 2024
@stacimc stacimc requested review from a team as code owners May 7, 2024 00:43
Copy link

github-actions bot commented May 7, 2024

Full-stack documentation: https://docs.openverse.org/_preview/4276

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

@sarayourfriend sarayourfriend removed their request for review May 7, 2024 06:02
@sarayourfriend
Copy link
Collaborator

I'm unassigning myself from review of this. I think I only got pinged because of the docs change, but it's just catalog, so left Krystle and Madison in to reflect that. If that doesn't work for y'all for any reason, let me know and I can review this PR.

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The reasoning behind the changes makes sense.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for jumping on this, and it's great that the DAG for fixing it is so simple! I have a few notes, including some concern about the URLs produced for testing. I was able to the ingestion DAG just fine locally, along with the cleanup DAG (both with records to clean up and without).

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!! Thanks for making those changes 🚀

@stacimc stacimc merged commit cf9e6a8 into main May 9, 2024
46 checks passed
@stacimc stacimc deleted the update/science-museum-check-urls branch May 9, 2024 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Some Science Museum records continue to have invalid URLs
4 participants