-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Science museum urls #4276
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/4276 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
I'm unassigning myself from review of this. I think I only got pinged because of the docs change, but it's just catalog, so left Krystle and Madison in to reflect that. If that doesn't work for y'all for any reason, let me know and I can review this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The reasoning behind the changes makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for jumping on this, and it's great that the DAG for fixing it is so simple! I have a few notes, including some concern about the URLs produced for testing. I was able to the ingestion DAG just fine locally, along with the cleanup DAG (both with records to clean up and without).
catalog/tests/dags/providers/provider_api_scripts/test_science_museum.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!! Thanks for making those changes 🚀
Fixes
Fixes #4261 by @stacimc
Description
Hopefully this should be the final change needed to get Science Museum back in shape after the API updates in #4105.
The problem is that even though we're now formatting URLs properly:
We don't want to ingest records with broken links. So this PR does two things:
This DAG will go back and correctly format all existing urls, and then check to see if the url is reachable. Since I don't know how many records will turn out to be unreachable, I have not opted to make the DAG automatically delete these records, just record their identifiers in a new table. Once the DAG runs, we'll be able to see how many are invalid and either investigate further or manually delete them using the recorded ids.
Testing Instructions
Run the Science Museum DAG and ensure that it still works. Let it ingest a few hundred records and then mark it as a success.
To test the maintenance DAG we will need to make sure we have at least one record which is invalid. You can do this manually by running
just catalog/pgcli
and then:Then run the
update_science_museum_urls
DAG and verify that everything passes and one record id is added to thescience_museum_invalid_ids
table.Checklist
Update index.md
).main
) or a parent feature branch.just catalog/generate-docs
for catalogPRs) or the media properties generator (
just catalog/generate-docs media-props
for the catalog or
just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin