Description
openedon May 3, 2024
Description
#4105 updated the Science Museum DAG after some provider API changes. Among other things the URLs for all our records needed to be updated, as all production records were now invalid. The difference generally looks like:
# Correct url
f"https://coimages.sciencemuseumgroup.org.uk/{url}"
# Old, invalid url. Note the /images/
f"https://coimages.sciencemuseumgroup.org.uk/images/{url}
After fixing this in the DAG we were able to run the DAG fully in production, skipping only 5 batches due to an error on the provider side (see #4013). However 27,810 records out of 122,194 total production Science Museum records still have the incorrectly formatted URL.
openledger> select count(*) from image where provider='sciencemuseum' and url ilike '%/images/%'
;
+-------+
| count |
|-------|
| 27810 |
+-------+
openledger> select count(*) from image where provider='sciencemuseum';
+--------+
| count |
|--------|
| 122194 |
+--------+
I selected the first 10 of these and manually tested them. All urls failed as they are currently formatted. Only four of them were fixed by simply removing the /images/
part of the path. Therefore I don't think it's sufficient to just run a batched_update where we alter the paths that way.
This also leads me to fear that some of the URLs which were updated by the DAG may still be invalid.
We need to investigate further to see why some aren't working, to understand if there's a better way to build the URLs, and then to update our production data.
Metadata
Assignees
Labels
Type
Projects
Status
✅ Done