Replies: 1 comment 3 replies
-
The idea behind raising an error in the MediaStore is to make sure that issues with existing provider scripts come to our attention, instead of being missed (which happens very easily if it is only logged). In some scenarios, for example, it might help us identify a bug with the provider API, or we might be able to find a suitable fallback value. In the event that we can't, we should still handle it in the provider. Adding that error in definitely means that we're risking finding errors in the existing DAGs, but that's theoretically a good thing. This 100% could definitely be really annoying if it results in errors hours into a DagRun, but almost all of our DAGs can be restarted from the point of failure using |
Beta Was this translation helpful? Give feedback.
-
I learned about the recent code change to raise an error if the provider script passes
None
as a value for one of the required parameters while working on #1883.It seems that the
MediaStore
was initially designed to be very permissive and to silently drop the invalid rows of media. This can be seen in the way it treats saving the items: if theget_image
method returns a valid row, it is saved, and the media count is updated. Otherwise, it is just silently skipped over. I can see the reason for this behavior in @stacimc 's recent comment about one of the ingestions failing several hours into the process. For the established and tested provider scripts, I think this is a valid approach: to not waste hours of work for a single item that might be invalid. We do drop many media items in the ingestion process anyways when they don't have the properties that we expect.I understand the value of throwing an error when we are running a new provider script, but it seems prohibitively expensive to do so for the established scripts. To ensure that it doesn't break the ingestion, we would need to refactor all provider scripts to ensure that all of the items have required non-falsy values in each provider script (some of this work has been done in #1883). Previously, those were checked for None values only to short-circuit the ingestion and to not spend time on ostensibly invalid media items.
My preferred approach would be to check if the required values are not falsy in the
MediaStore
, and log a debug statement if a falsy value is passed, but continue the ingestion process for the vast majority of other items for the same provider.Some other things I've thought of were to return a dataclass from the
get_record_data
in the provider scripts. This would validate the values and not return anything if the required values are missing. Or to add a check in the provider data ingester after theget_record_data
returns. With the main goal being to enable ingestion to run even when there are some inconsistencies in the provider API.@krysal, @stacimc, I've read your comments on the PR that added the Exception, so I'm really curious about your thoughts here
Beta Was this translation helpful? Give feedback.
All reactions