-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "globals" functionality to OmegaConfigLoader
#2175
Comments
Continuing our discussion from #2122, I modified my take on the For me, there are still two open to-do's:
All feedback is welcome! More than happy to contribute this implementation too! |
OmegaConfLoader
OmegaConfigLoader
OmegaConfigLoader
OmegaConfigLoader
A user confused by this: conversation in Slack. |
Adding another data point from Slack |
Issues:
Currently we required a custom configloader to solve the issues. I wasn't sure why we never put the fix in 0.18.x, but I think it's worth to bring this issue up again. Bonus: This is a question that asked frequently: |
Question @merelcht |
Summary of the Technical Design discussion on 5/4/2023: General comments:
Comments on the proposed solution above:
Follow up actions:
|
Trying to understand the summary of the discussion, but I think it is easier to come up with a generic "example" and describe its desired behaviour. Let's assume we have the following structure:
When we do e.g.
we don't want
Anyway, I hope we can use this "example" to clarify behaviour. |
This is a good point that we didn't discuss explicitly, but I think it's nice to achieve. |
The simple solution: assuming there is always a globals key in front of interpolations e.g. |
I would categorise it as a variant to the _ suffix |
Absolutely, but far easier to filter out! |
The catalog.yml
example_iris_data:
type: ${globals.data_type}
filepath: data/01_raw/iris.csv parameters.yml
train_fraction: 0.8
random_state: 2
target_column: ${globals.column} globals.yml
globals:
data_type: pandas.CSVDataSet
column: species |
@merelcht, this isn’t exactly what I had in (although this might be cleaner). What I thought about doing was the following: catalog.yml
example_iris_data:
type: ${globals.data_type}
filepath: data/01_raw/iris.csv parameters.yml
train_fraction: 0.8
random_state: 2
target_column: ${globals.column} globals.yml
data_type: pandas.CSVDataSet
column: species And only during loading of globals_dict = {
“globals”: {
“data_type”: “pandas.CSVDataSet”,
“column”: “species”,
}
} |
Ah okay I see, thanks for clarifying! That's indeed a nicer solution. I'll do some more thinking about this 😄 |
OmegaConfigLoader
OmegaConfigLoader
I'm probably off-base with these comments, but adding on the off chance that I'm not. I'm imagining three scenarios:
My personal opinion based on prior experience w/ Kedro is that option 3 might overcomplicate things and result in unintended behavior when users don't fully understand what/how parameters are being shared/inherited across environments. I've run into an issue in the past with spark.yml when handing my kedro project off to an MLE to orchestrate with Airflow. In that case, we created a prod environment with the spark.yml deleted. When he attempted to set the spark configs on the cluster from Airflow, he was confused why none of his configs were taking effect. The reason turned out to be that his configs were being overwritten by the spark.yml in the base environment and we ended up modifying SparkHooks to prevent this override. |
@bgereke Thanks for your comments. I have a couple follow up questions.
Does that mean things like "sparks" should be excluded? and why do you need a global only for When it's only file-base, I think it's obvious that local config should always win. i.e. local > base > globals. But which config should be override if one is using the command line I agree 3. is confusing and it's hard to even figure out where's the value coming from. So overall you think there is a need of globals (similar to environment variable) and an environment-awared globals |
@noklam I'll see if I can answer your Qs:
Not totally sure what you're asking here, but I hadn't considered the option of templating spark.yml. I'm probably a little skeptical of whether that would be a good idea.
I was actually considering options 1 and 2 as mutually exclusive in my comment above. Either globals go in conf/globals.yml or conf/env/globals.yml but never both. Probably I'm not thinking enough about breaking changes for option 1 since globals.yml already lives inside environments. If that's the case, I'd vote for globals NOT being shared across environments but would need to better understand the use cases where this is needed and why those cases can't be solved by creating an environment-specific globals.yml.
I also think globals.yml should be overridable by --params by default. In fact, I already create a custom TemplatedConfigLoader to enable this today as my primary use for globals.yml is to integrate with other systems like Airflow and Weight and Biases. Another thing to consider in these cases is whether or not to support nested config in globals.yml since that can complicate integrating with those systems. |
Thanks for clarifying, it makes sense as a mutually exclusive option!
What kind of configuration do you override usually? curious of what goes in the
Nested config for global will work for OmegaConfigLoader by default. Currently we have a |
Late to the party, just wanted to note that now we're saying that templating does not work for catalog files: kedro/docs/source/configuration/advanced_configuration.md Lines 239 to 241 in 41f03d9
However, it does work - I tested with this dataset1:
type: pandas.CSVDataSet
filepath: ${.metadata.location}
metadata:
location: train.csv
dataset2:
type: ${dataset1.type}
filepath: ${dataset1.filepath} More precisely, at the moment it's not allowed to use templated variables from parameter files in catalog files (the scope of this issue). |
Summary of Technical Design Discussion on 13/07/2023
Solutions discussed :
Other Details :Q: Configurable Q: Where would
Q: Soft or hard merge common keys? Q: Allow |
Closing in favour of the implementation ticket: #2794 |
Description
In the new
OmegaConfigLoader
templating works out of the box for parameters, but not for catalog files. This is because the catalog has to have certain characteristics and is loaded through settingskedro/kedro/framework/context/context.py
Lines 283 to 295 in 8acfb2a
This also means that currently users cannot share "global" values between parameter and catalog configuration files.
Context
Users struggle with large catalog files, so we should make it possible for them to do templating with this new config loader.This issue only focusses on solving the need of having a central place to store certain values: e.g. file paths, bucket names, constant values.
Questions
Possible Implementation
Require template values to be preceded by a special character, e.g.
_
so they're not read as a dataset in the catalog validation.You could then have a catalog like this:
Globals can be "activated" for each config file by adding "globals*" to the config patterns:
Possible Alternative
Add
globals
functionality similar to that of theTemplatedConfigLoader
.globals_dict
?The text was updated successfully, but these errors were encountered: