-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: modify catalog validation to allow variable interpolation #2516
Conversation
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Tangentially related question: would this help address kedro-org/kedro-plugins#124 ? |
@astrojuanlu Not directly, if we ever want to do this it will be likely via the resolver. |
Thanks for coding this up! I meant to do so last week but didn't get around to it. I like it and can't immediately think of any tricky edge cases where it wouldn't work as expected. Would be interested in hearing what @deepyaman thinks since he wasn't convinced by this proposal initially. Also, I guess @idanov might have comments since kedro could have originally adopted this scheme in the first place for YAML anchors (i.e. not explicitly require |
Can you give an example of this? I guess you're saying that anything with Only concern that comes to mind is, what if you want to have a variable like: csv_dataset:
type: pandas.CSVDataSet
save_args:
index: True
my_partitioned_dataset:
type: PartitionedDataset
filepath: data/03_primary/partitions
dataset: ${csv_dataset} I think this is a pretty likely use case, so how would we make sure
Pretty sure I'm not convinced by anything initially. I speak not as a devil's advocate, but as the devil himself (probably). But overall, other than the above edge case, I think this makes sense and I like the fact that using more OmegaConf out of the box. Maybe I was less convinced by some documents-based solution IIRC? |
So basically if you'd have a dataset called example_iris_data:
filepath: data/01_raw/iris.csv The error you'd see will now be:
Instead of:
This would be a very nice to have, but I wouldn't bother trying to get that to work in the first iteration of this feature.
This is a very good point and a use case where this solution fails. We'd have to add another heuristic to decide the validation for this..
😂 |
Thanks for the comments @deepyaman! I had thought of this same edge case and decided that actually it's not an issue at all. In the example you give, |
@AntonyMilneQB That setup gives the following error:
|
Maybe the logic should be any catalog entry that has @AntonyMilneQB @deepyaman |
Hmm yes, I didn't realise that I think the better way around this is to think of this The other way round this would be to not try and make catalog entries out of anything that's used as the target of a variable interpolation, but that sounds more awkward and probably has other edge cases associated with it. |
Even if it did work, I think finding unexpected catalog entries isn't a great behavior, and could also mess with some plugins.
As @AntonyMilneQB mentioned in his response, this wouldn't work for some datasets. We could build some sophisticated way to try and figure out if something is a dataset or not, but there's always a chance some edge case will come up if it's too fancy, and then it becomes a pain to investigate.
I can imagine you'll get a lot of questions about "why isn't it working (without underscore)", since people will not be used to changing the syntax for partitioned dataset. But this could work... Alternatively, how about allowing something like: csv_dataset:
type_: pandas.CSVDataSet # So that the inner dataset type doesn't get interpreted as a variable, we allow a form of escaping of the `type` field.
save_args:
index: True
my_partitioned_dataset:
type: PartitionedDataset
filepath: data/03_primary/partitions
dataset: ${csv_dataset} Honestly not sure if it's any better than specifying I would honestly not mind that much if the |
Ahh In earlier discussions in tech design people said they didn't like the underscore syntax, because it introduces special "kedro-omegaconf" syntax and users wouldn't be able to use the omegaconf docs for instructions. So that's why I was finding a different solution for making this work @deepyaman. I agree that a more sophisticated validation method would probably end up being hard to debug and I think you're right there will always be edge cases. So I guess in that case the new suggestion would be to do both the escaping of entries starting with an underscore and the modifications to the catalog validation introduced in this proposal PR. Do you agree? I personally would actually still be mostly in favour of just requiring the |
FYI I also spoke with @idanov who is in favour of the "for the first one - there's no new special syntax, it's simply what the Kedro |
My thoughts on this have not really changed since it was first discussed in tech design: personally I would prefer it if we made it possible for people to do this without the But we already have the
|
Thanks everyone for sharing your thoughts on this. I'm going to close this proposal PR and instead go ahead with an implementation that uses the |
Description
#2507
Development notes
omegaconf
variable interpolation doesn't work out of the box, because of the catalog validation that happens later down the line. When you have a catalog like this:The catalog validation will fail saying that the entry "data" doesn't have the 'type' key which is required for dataset entries. Obviously this entry isn't meant to be read or parsed as a dataset, because it's just the value that needs to be read into the template above.
This solution modifies the catalog validation to check if an entry contains 'type', if it does it will then be parsed as a dataset. Otherwise, it's skipped from that conversion.
Pros
omegaconf
syntax for variable interpolation in the catalog, no special hidden character syntax needed.Cons
Alternative solutions
Require template values to be preceded by a special character, e.g.
_
so they're not read as a dataset in the catalog validation. This is how yaml anchors work in theConfigLoader
andTemplatedConfigLoader
.You could then have a catalog like this:
Pros
Cons
omegaconf
docs to find out how variable interpolation works.FYI I didn't change any tests yet. This PR draft is initially meant to facilitate discussion.
Checklist
RELEASE.md
file