Skip to content

Metropolitan foreign identifiers not guaranteed to be unique #3263

Open

Description

Description

While working on #3262, I noticed that the way we're setting foreign identifiers in the Metropolitan DAG is not guaranteed to be unique:

def _get_foreign_id(self, object_id: int, image_url: str) -> str:
        unique_identifier = image_url.split("/")[-1].split(".")[0]
        return f"{object_id}-{unique_identifier}"

The problem is that some Metropolitan urls have periods in the filename, example:

https://images.metmuseum.org/CRDImages/dp/original/1972.173%20RECTO.jpg

The foreign identifier that's generated for this is "45734-1972", using only the first part of the filename. Since we include the object id as part of the foreign identifier, the only way this would lead to duplicate foreign ids is if an object has multiple related images that start with the same thing (ie, if this image had a related image titled "1972.foo.jpg". This will raise an error and cause the DAG to fail.

Reproduction

You can locally call _get_foreign_id with the example URL I gave to verify the behavior.

Additional context

The problem is that if we change the way foreign_identifier is determined, we'll overwrite the foreign id for existing records on reingestion. If we decide to do that, we need to first test and confirm that this does not result in those records being added as new records during reingestion (rather than correctly overwriting the old ones). Otherwise, we'll need a way to identify those orphaned records and remove them.

Consequently, I don't think we should work on this issue until we actually see it cause an error in production. It's possible the condition will never practically be hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    • Status

      📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions