Skip to content

Some Science Museum records continue to have invalid URLs #4261

Closed

Description

Description

#4105 updated the Science Museum DAG after some provider API changes. Among other things the URLs for all our records needed to be updated, as all production records were now invalid. The difference generally looks like:

# Correct url
f"https://coimages.sciencemuseumgroup.org.uk/{url}"

# Old, invalid url. Note the /images/
f"https://coimages.sciencemuseumgroup.org.uk/images/{url}

After fixing this in the DAG we were able to run the DAG fully in production, skipping only 5 batches due to an error on the provider side (see #4013). However 27,810 records out of 122,194 total production Science Museum records still have the incorrectly formatted URL.

openledger> select count(*) from image where provider='sciencemuseum' and url ilike '%/images/%'
 ;
+-------+
| count |
|-------|
| 27810 |
+-------+

openledger> select count(*) from image where provider='sciencemuseum';
+--------+
| count  |
|--------|
| 122194 |
+--------+

I selected the first 10 of these and manually tested them. All urls failed as they are currently formatted. Only four of them were fixed by simply removing the /images/ part of the path. Therefore I don't think it's sufficient to just run a batched_update where we alter the paths that way.

This also leads me to fear that some of the URLs which were updated by the DAG may still be invalid.

We need to investigate further to see why some aren't working, to understand if there's a better way to build the URLs, and then to update our production data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

Type

No type

Projects

  • Status

    ✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions