Skip to content

How we handle data integrity issues across the stack #292

Closed

Description

Due date: 2022-10-18

I'm not sure what a due date would be for this. Two weeks? What do y'all think.

Assigned reviewers

Description

Is there anyone who has a comprehensive enough view to describe the process by which data integrity issues should be fixed? I am very concerned by the implications of the information in this discussion.

Sentiments like "no part of the stack guarantees the shape of its data" concerns me because we control every part of the stack in question. We have complete control over the shape of data in the Catalogue (up unto the point of actually receiving the information from the provider) and the API itself acts as yet another layer of transformation that I would hope would have a stable guaranteed surface. Something like licensing information, given it's a central aspect of why Openverse exists at all seems to be at least the bottom of what one should expect to be at all times supplied in full by the API and to exist in the catalogue.

Perhaps this also comes down, in some part, to API version guarantees that we are also murky on.

Does anyone have a holistic view of how we are meant to solve data integrity issues? If the frontend is broken due to an issue in the data returned by the API, do we take a step back and ask, "is the API returning this data because it itself is misbehaving, or is the data it's working off if incorrect (i.e., not properly ingested in the catalogue)"?

I am worried that we don't actually ask this question enough or potentially we do ask it but then do not document the fixes that need to be made upstream. One aspect of this I think is the inherent latency of issues as they flow from the catalogue to the API to the frontend. The frontend, right now, is very easy to deploy fixes to. The API is a little harder with a wide range depending on the severity of the change (migrations, column changes, etc). The catalogue is easy to deploy DAG fixes to, but the changes in the shape of the data may not make it down stream for at least a week if not longer depending on other issues in our data flow.

Do we need to have a more coordinated effort across the different parts of the stack to fix data integrity issues? How do we decide when and where to apply patches? If we hot fix things in the frontend (as is done in the PR linked above), should we do so with the expectation that the issue will be fixed upstream, whether the root cause is in the API or the catalogue? Should those hot fixes in the frontend or API eventually be slated for removal if their upstream dependencies they're compensating for get fixed?

My gut tells me that a process something like "hot fix it wherever makes the fix get out the soonest to the broadest set of clients" is probably the better answer, which would mean more work in the API than on the frontend to fix these things. Depending on the effort and complexity there maybe moving the fix down into the frontend temporarily is beneficial. Always moving data integrity things up into the catalogue seems like a natural assumption, I don't know if it's necessarily one we're making. Maybe we are and I'm not connecting the dots though. In some sense, a monorepo would really help in here as we could have a single issue that tracks the fix as it makes its way through the application layers and gets patched, promoted, and removed once things are solidified upstream.

What do y'all think about this? Are there improvements to the process we can follow here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions