Description
openedon Oct 26, 2023
Description
While working on #3262, I noticed that the way we're setting foreign identifiers in the Metropolitan DAG is not guaranteed to be unique:
def _get_foreign_id(self, object_id: int, image_url: str) -> str:
unique_identifier = image_url.split("/")[-1].split(".")[0]
return f"{object_id}-{unique_identifier}"
The problem is that some Metropolitan urls have periods in the filename, example:
https://images.metmuseum.org/CRDImages/dp/original/1972.173%20RECTO.jpg
The foreign identifier that's generated for this is "45734-1972", using only the first part of the filename. Since we include the object id as part of the foreign identifier, the only way this would lead to duplicate foreign ids is if an object has multiple related images that start with the same thing (ie, if this image had a related image titled "1972.foo.jpg". This will raise an error and cause the DAG to fail.
Reproduction
You can locally call _get_foreign_id
with the example URL I gave to verify the behavior.
Additional context
The problem is that if we change the way foreign_identifier
is determined, we'll overwrite the foreign id for existing records on reingestion. If we decide to do that, we need to first test and confirm that this does not result in those records being added as new records during reingestion (rather than correctly overwriting the old ones). Otherwise, we'll need a way to identify those orphaned records and remove them.
Consequently, I don't think we should work on this issue until we actually see it cause an error in production. It's possible the condition will never practically be hit.
Metadata
Assignees
Labels
Type
Projects
Status
📋 Backlog