Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: modify catalog validation to allow variable interpolation #2516

Closed
wants to merge 1 commit into from

Conversation

merelcht
Copy link
Member

@merelcht merelcht commented Apr 13, 2023

Description

#2507

Development notes

omegaconf variable interpolation doesn't work out of the box, because of the catalog validation that happens later down the line. When you have a catalog like this:

example_iris_data:
  type: ${data}
  filepath: data/01_raw/iris.csv

data: pandas.CSVDataSet

The catalog validation will fail saying that the entry "data" doesn't have the 'type' key which is required for dataset entries. Obviously this entry isn't meant to be read or parsed as a dataset, because it's just the value that needs to be read into the template above.

This solution modifies the catalog validation to check if an entry contains 'type', if it does it will then be parsed as a dataset. Otherwise, it's skipped from that conversion.

Pros

  • Users can use the standard omegaconf syntax for variable interpolation in the catalog, no special hidden character syntax needed.
  • Variable interpolation in the catalog works exactly the same as for parameters

Cons

  • Validation is changed so it doesn't try to parse entries without the 'type' key. However, if a dataset is missing the 'type' key but is supposed to be parsed, that is now only detected after the run.

Alternative solutions

Require template values to be preceded by a special character, e.g. _ so they're not read as a dataset in the catalog validation. This is how yaml anchors work in the ConfigLoader and TemplatedConfigLoader.

You could then have a catalog like this:

_pandas:
  type: pandas.CSVDataSet

example_iris_data:
  type: ${_pandas.type}
  filepath: data/01_raw/iris.csv

Pros

  • No need to change catalog validation.

Cons

  • A concern with this implementation is that it introduces special Kedro syntax to enable core omegaconf functionality. Users would then not be able to just use the omegaconf docs to find out how variable interpolation works.
  • Variable interpolation would be different for the catalog and parameters, because for parameters no special character is needed.

FYI I didn't change any tests yet. This PR draft is initially meant to facilitate discussion.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
@merelcht merelcht self-assigned this Apr 13, 2023
@astrojuanlu
Copy link
Member

Tangentially related question: would this help address kedro-org/kedro-plugins#124 ?

@noklam
Copy link
Contributor

noklam commented Apr 13, 2023

@astrojuanlu Not directly, if we ever want to do this it will be likely via the resolver.

@antonymilne
Copy link
Contributor

antonymilne commented Apr 13, 2023

Thanks for coding this up! I meant to do so last week but didn't get around to it. I like it and can't immediately think of any tricky edge cases where it wouldn't work as expected. Would be interested in hearing what @deepyaman thinks since he wasn't convinced by this proposal initially.

Also, I guess @idanov might have comments since kedro could have originally adopted this scheme in the first place for YAML anchors (i.e. not explicitly require _ prefix but instead just ignore entries which don't "look like" dataset entries as you do here). Maybe there's a good reason this solution wasn't chosen before? The validation still seems clear to me here even though the error message is pushed further down the kedro run.

@deepyaman
Copy link
Member

Validation is changed so it doesn't try to parse entries without the 'type' key. However, if a dataset is missing the 'type' key but is supposed to be parsed, that is now only detected after the run.

Can you give an example of this? I guess you're saying that anything with type is a dataset, and anything without is a variable that could be used for interpolation. I think that's fine. Maybe it makes sense to warn about unused variables (but may be too much work, without having access to the config loader).

Only concern that comes to mind is, what if you want to have a variable like:

csv_dataset:
  type: pandas.CSVDataSet
  save_args:
    index: True

my_partitioned_dataset:
  type: PartitionedDataset
  filepath: data/03_primary/partitions
  dataset: ${csv_dataset}

I think this is a pretty likely use case, so how would we make sure csv_dataset here gets treated as a variable?

Would be interested in hearing what @deepyaman thinks since he wasn't convinced by this proposal initially.

Pretty sure I'm not convinced by anything initially. I speak not as a devil's advocate, but as the devil himself (probably).

But overall, other than the above edge case, I think this makes sense and I like the fact that using more OmegaConf out of the box.

Maybe I was less convinced by some documents-based solution IIRC?

@merelcht
Copy link
Member Author

Validation is changed so it doesn't try to parse entries without the 'type' key. However, if a dataset is missing the 'type' key but is supposed to be parsed, that is now only detected after the run.

Can you give an example of this? I guess you're saying that anything with type is a dataset, and anything without is a variable that could be used for interpolation. I think that's fine.

So basically if you'd have a dataset called example_iris_data and forget to add the type:

example_iris_data:
  filepath: data/01_raw/iris.csv

The error you'd see will now be:

ValueError: Pipeline input(s) {'example_iris_data'} not found in the DataCatalog. Make sure that any dataset entry contains the 'type' key.

Instead of:

DataSetError: An exception occurred when parsing config for DataSet 'example_iris_data':
'type' is missing from DataSet catalog configuration

Maybe it makes sense to warn about unused variables (but may be too much work, without having access to the config loader).

This would be a very nice to have, but I wouldn't bother trying to get that to work in the first iteration of this feature.

Only concern that comes to mind is, what if you want to have a variable like:

csv_dataset:
  type: pandas.CSVDataSet
  save_args:
    index: True

my_partitioned_dataset:
  type: PartitionedDataset
  filepath: data/03_primary/partitions
  dataset: ${csv_dataset}

I think this is a pretty likely use case, so how would we make sure csv_dataset here gets treated as a variable?

This is a very good point and a use case where this solution fails. We'd have to add another heuristic to decide the validation for this..

Pretty sure I'm not convinced by anything initially. I speak not as a devil's advocate, but as the devil himself (probably).

😂

@antonymilne
Copy link
Contributor

antonymilne commented Apr 18, 2023

Thanks for the comments @deepyaman! I had thought of this same edge case and decided that actually it's not an issue at all. In the example you give, csv_dataset would be treated as a variable and a dataset definition. Hence you'd get both datasets my_partitioned_dataset and csv_dataset defined. The only problem here is that csv_dataset is a dataset that is unintentionally defined and won't be used anywhere in the pipeline... It's slightly untidy, but does it actually matter? I don't think it will cause any problems.

@merelcht
Copy link
Member Author

@AntonyMilneQB That setup gives the following error:

DataSetError:
__init__() missing 1 required positional argument: 'filepath'.
DataSet 'csv_dataset' must only contain arguments valid for the constructor of 'kedro_datasets.pandas.csv_dataset.CSVDataSet'.

@merelcht
Copy link
Member Author

merelcht commented Apr 18, 2023

Maybe the logic should be any catalog entry that has type and either filepath or path is a dataset and anything else is just used for interpolation?

@AntonyMilneQB @deepyaman

@antonymilne
Copy link
Contributor

antonymilne commented Apr 18, 2023

Hmm yes, I didn't realise that csv_dataset would be an invalid catalog entry still 🤔 Just checking for the filepath or path keys won't work with e.g. APIDataSet which doesn't have either so would be too strict.

I think the better way around this is to think of this ParitionedDataSet as a special case and just require people to use the underscore, like _csv_dataset, in this case. We always intended for it to be possible for people to still specify the _ to explicitly state that the key should be ignored if they wanted to - it's just that most of the time, it shouldn't be needed.

The other way round this would be to not try and make catalog entries out of anything that's used as the target of a variable interpolation, but that sounds more awkward and probably has other edge cases associated with it.

@deepyaman
Copy link
Member

Thanks for the comments @deepyaman! I had thought of this same edge case and decided that actually it's not an issue at all. In the example you give, csv_dataset would be treated as a variable and a dataset definition. Hence you'd get both datasets my_partitioned_dataset and csv_dataset defined. The only problem here is that csv_dataset is a dataset that is unintentionally defined and won't be used anywhere in the pipeline... It's slightly untidy, but does it actually matter? I don't think it will cause any problems.

Even if it did work, I think finding unexpected catalog entries isn't a great behavior, and could also mess with some plugins.

Maybe the logic should be any catalog entry that has type and either filepath or path is a dataset and anything else is just used for interpolation?

@AntonyMilneQB @deepyaman

As @AntonyMilneQB mentioned in his response, this wouldn't work for some datasets. We could build some sophisticated way to try and figure out if something is a dataset or not, but there's always a chance some edge case will come up if it's too fancy, and then it becomes a pain to investigate.

I think the better way around this is to think of this ParitionedDataSet as a special case and just require people to use the underscore, like _csv_dataset, in this case. We always intended for it to be possible for people to still specify the _ to explicitly state that the key should be ignored if they wanted to - it's just that most of the time, it shouldn't be needed.

I can imagine you'll get a lot of questions about "why isn't it working (without underscore)", since people will not be used to changing the syntax for partitioned dataset. But this could work...

Alternatively, how about allowing something like:

csv_dataset:
  type_: pandas.CSVDataSet  # So that the inner dataset type doesn't get interpreted as a variable, we allow a form of escaping of the `type` field.
  save_args:
    index: True

my_partitioned_dataset:
  type: PartitionedDataset
  filepath: data/03_primary/partitions
  dataset: ${csv_dataset}

Honestly not sure if it's any better than specifying _csv_dataset, but it's another option.


I would honestly not mind that much if the _ prefix were required for all variables (non-datasets?); I feel like it could be a case where explicit is better than implicit. I also think you can always look to later relax the _ restriction if this is something people complain about a lot, but I think it could actually just be a good convention to separate the variables and dataset definitions (without going as far as globals.yml).

@merelcht
Copy link
Member Author

merelcht commented Apr 18, 2023

Ahh APIDataSet😵‍💫 .. I like the idea of treating the PartionedDataset as a special case and requiring an underscore for those entries. @AntonyMilneQB and then just allowing users to use the underscore in general if they find that clearer for catalog entries that are not meant as datasets.

In earlier discussions in tech design people said they didn't like the underscore syntax, because it introduces special "kedro-omegaconf" syntax and users wouldn't be able to use the omegaconf docs for instructions. So that's why I was finding a different solution for making this work @deepyaman.

I agree that a more sophisticated validation method would probably end up being hard to debug and I think you're right there will always be edge cases.

So I guess in that case the new suggestion would be to do both the escaping of entries starting with an underscore and the modifications to the catalog validation introduced in this proposal PR. Do you agree?

I personally would actually still be mostly in favour of just requiring the _ prefix, because it involves the least changes and it definitely isn't breaking, where the change in catalog validation feel like it's technically breaking to me..

@merelcht
Copy link
Member Author

FYI I also spoke with @idanov who is in favour of the _ syntax as well and not changing the catalog validation. He feels that the objections to this syntax aren't really valid because:

"for the first one - there's no new special syntax, it's simply what the Kedro DataCatalog skips / doesn't skip and those rules are there regardless of OmegaConf. And for the second, still the same - the special rule of dropping _ works only for catalog [with all config loaders]. I think this is the easiest approach, will require the least amount of changes + no breaking changes."

@antonymilne
Copy link
Contributor

My thoughts on this have not really changed since it was first discussed in tech design: personally I would prefer it if we made it possible for people to do this without the _, but it's not a big deal to me. IMO the PartitionedDataSet is still an advanced/edge case which most people are unlikely to come across (you have to want to use PartitionedDataSet and do variable interpolation, both of which are somewhat advanced features), and we should base our syntax around what is most intuitive and easy for most of our users. That would be mean not requiring _ to be there explicitly (i.e. consistent with how parameters work).

But we already have the _ requirement so I'm very happy to keep it for now and then maybe we can relax it later. So basically I fully agree with this part of what @deepyaman said:

I would honestly not mind that much if the _ prefix were required for all variables (non-datasets?); I feel like it could be a case where explicit is better than implicit. I also think you can always look to later relax the _ restriction if this is something people complain about a lot

@merelcht
Copy link
Member Author

Thanks everyone for sharing your thoughts on this. I'm going to close this proposal PR and instead go ahead with an implementation that uses the _ character syntax. I'll wait for the changes @noklam is going to do for the OmegaConfigLoader to not interfere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants