Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable overriding configuration via variable interpolation for all configurations other than just parameters for OmegaConfigLoader #2531

Closed
noklam opened this issue Apr 19, 2023 · 7 comments · Fixed by #3036
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@noklam
Copy link
Contributor

noklam commented Apr 19, 2023

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

With #2467 we resolves OCL earlier and parameters propogate correctly for parameters. We cannot do the same for catalog or other configuration yet due to the problem of "residual" configurations. The problem is specific to OmegaConfigLoader, because it requires merging the dictionaries before resolving.

Take this simple example

# catalog.yml
my_dataset:
   path: ${base_path}/my_file.txt

Expected Result

kedro run --params base_path:my_bucket this should work and replace the value correctly.

Current Result

The value will not propagate for configurations other than parameters. This is because we need to merge aggregated_configs with runtime_params, which produce invalid catalog entries.

The compiled catalog may look like this

# catalog.yml
my_dataset:
  path: ${base_path}/my_file.txt


base_path: my_bucket # <-- This is not a dataset, ideally we need to remove it or bypass it.

Context

Why is this change important to you? How would you use it? How can it benefit other users?
This is essential to enable common use case like providing a bucket name for catalog via CLI.
kedro run --params base_path:my_bucket , with TemplatedConfigLoader it doesn't work either, but there are workarounds. In contrast, there is no workaround for OmegaConfigLoader.

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

  1. Figure out a way to remove the residual configurations

The implementation should make sure above case will work correctly. It should also work for arbitrary configurations, i.e. config_loader["spark"] or config_loader["my_random_config"]as long as it's registered withconfig_patterns`

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

Detail context

Change 1: Resolves configs eagerly in ConfigLoader - only enable for parameters for now
config_loader[“parameters”] return Dict instead of DictConf
interpolation will works correctly
Change 2: Separate the runtime_params & meta_params - the name of “meta_params” need to be determined afterward. Some concerns around “runtime_params” works for “parameters” only, but “mete_params” works for everything and will confuse people

  1. Do Change 1 now - limited the scope to “parameters”
  2. Change 2 should be done, but requires some extra thinking - potentially interfering with the solution of enabling “globals” for OCL
  3. Change 2 should be done - would be great to do it in a non-breaking way. General agreement on the change, but unsure about what’s the correct name of this argument.
    Some altenative: “config_vars”, “meta_config”

Extended Discussion

How does “global” fit in the picture? It will be clearer when we finished Change 1
Antony thinks that with the validation change, it shouldn’t be matter even when the “residuals” are left in the config, Python config also don’t do anything special to extra configurations.
Ivan think this doesn’t solve the problem in a generic way, config_lodaer[“global”] is purposed, but this doesn’t solve all the problem because OmegaConfigLoader requires to merge dictionary before interpolating. This shouldn’t be a limitation, but will requires more work to implement a solution
Merel has some concern about the messaging of “meta_params”, will people really understand which argument to be used.
kedro run --params=base_path:s3_bucket will not work for catalog yet, but agree that it would be nice to support.

@merelcht
Copy link
Member

Needs to be completed after #2507

@ankatiyar
Copy link
Contributor

ankatiyar commented Aug 18, 2023

There's two (potentially more) solutions to this ticket -

Solution 1

I have a draft PR with this one - #2931
Suppose for runtime params (from the CLI) you get -

runtime_params = {"_base_path": "<some_other_path>", "random_state":5}

Catalog.yml -

cars:
  filepath: ${_base_path}
  type: pandas.CSVDataSet

_base_path: <some_path>

parameters.yml -

random_state : 45
split: 0.2
  • For parameters we merge with runtime params with OmegaConf.merge(). Doesn't really matter if extra unnecessary params make it to config_loader["parameters"] inside a session run.
  • For catalog (+ other config) :
    1. Keep track of the original keys in the config
    2. Merge config + runtime params with OmegeConf.merge()
    3. Resolve variable interpolation
    4. Filter out keys that start with '_' and keys that weren't in the original keys (might have been intended for parameters etc).
      This makes sure that you can't add any top level keys to the config through CLI commands.

Things to note :

  • You could overwrite other non top level keys through CLI params like cars.type even if it's not an interpolated variable.
  • No special syntax

Solution 2

Use a similar custom resolver approach to the globals implementation.

  • Introduce a "runtime_params" resolver
  • Have users explicitly specify they want this interpolation overwritten by run time params when available
    catalog.yml
cars:
  filepath: "${runtime_params:path, <some_default_path>}"
  type: pandas.CSVDataSet

Comments from @noklam on Slack

I now think that using a custom resolver may be a elegant solution. It comes with a few advantage:

  1. No need to escape from the _ for catalog
    the resolver can optionally takes a default, i.e. if it exists in runtime_params then override it, otherwise takes the default.
  2. More explicit name
  3. This way we can handle kedro run --params my_base_path=<some_path> elegantly without caring about the _ and merging issue.
  4. is align with Separate params to runtime_params and meta_params ticket, we used to suffer from the problems that “params” is overloaded for everything. We are still overloading params with this solution but at least we make the usage of it

Things to note:

  • How does this align with the merging of params we do for "parameters"? Users would not need to use the resolver in the parameters.yml but in other configs. Would this be hard to explain?

❓ If we go with this solution

  • Q: What should be the name of the resolver? runtime_params: or something else?

Tagging for feedback: @noklam @merelcht

@noklam
Copy link
Contributor Author

noklam commented Aug 18, 2023

One thought - what if we inject whatever exist in --params to global? This way we don't need an extra resolver.

Pro:

  • Easy implementation
  • No extra resolver

Con:

  • Mixed runtime_params and meta_params(but consistent with current approach)

@merelcht
Copy link
Member

I like the resolver solution, but also think it's important to have consistency between how runtime resolution happens for parameters and other config files. So if the decision is to go with the resolver I'd vouch for also implementing that for parameters.

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Aug 23, 2023

I really like the resolver approach, because it makes the approach less magic and more declarative with fine grained level. Eventually this will facilitate maitenance* because we can create helpers to show which variables are available for overriding (e.g. a CLI kedro config runtime which will return the list of runtime parameters)

I have a question though. How will the (potential) runtime_params resolver interact with the globals resolver? I guess these will often be used together: the globals resolver enable overriding from a config file and the runtime_params one from the CLI but they are two ways to achieve the same goal, e.g. overriding some configuration which is expected to change.

Should we expect users be able to do something like that:

cars:
  filepath: "${runtime_params:path, ${globals:path, <some_default_path>}}"
  type: pandas.CSVDataSet

The syntax feels a bit cluttered and hard to read for beginners.

Should we :

  1. create an extra resolver runtime_params_or_globals (name completely made up) which implements basically above combination and end up with 3 resolvers?
  2. merge both resolvers to end up with a single one which suppports both globals and runtime params at the same time?
  3. Eventually suggest how to do 1. in the documentation, but not implement it on the framework side, like you do with the spark hook?

@noklam
Copy link
Contributor Author

noklam commented Aug 23, 2023

@Galileo-Galilei I have the same thought before and I try to summarize @ankatiyar thought.

  • globals has 1 and only 1 entry point, which is the yaml file.
  • If it need to be override from CLI, use runtime_params

In earlier comment I have an idea that we can inject runtime_param to global so you can override if runtime_param. This is less explicit and bring us closer to the TemplatedConfigLoader where everything is called runtime_params.

I try to summarise the feature that we want (I think we need to improve the terminology too):

  1. Support overriding via CLI (any config, at least for params and catalog). i.e. kedro run --params base_path=my_s3_bucket, learning_rate=0.1 (This ticket)
  2. template variable that can be used across environment ( "globals" resolver)
  3. template variable that can be used within environment & same catagory (i.e. params or catalog) - OmegaConf variable interpolation

I think the question here is do we want to support overriding global via runtime_params, can the same be achieve with runtime_params already? IMO it's reasonable, but it also make it less explicit and harder to reason. (one or two extra hops to figure out where is the true value). Assuming we want to support this, there are many entry points that we can assign a value. They should be resolving in this order (Highest to Lowest):

  1. --param CLI (a.k.a. runtime_params)
  2. global resolver - i.e. {$global: value}
  3. variable interpolation
  4. environment variable

Essentially the more specific scope should takes higher priority.

We had a separate ticket that we explore using an extra CLI argument to make things more explicit, but in practice users may not really care about the difference.

@ankatiyar
Copy link
Contributor

Summary from discussion on 8/09/2023 with @merelcht @noklam @amandakys:

  • Implement the runtime_params: resolver solution.
  • Allow users to overwrite parameters the usual way through CLI but also allow using runtime_params: resolver
  • For non breaking release in 0.18.x
    • Don't allow nesting of globals and runtime_params -
    cars:
    filepath: "${runtime_params:path, ${globals:path, <some_default_path>}}"
    type: pandas.CSVDataSet
    
    • Don't allow the use of runtime_params: resolver in globals.yml
  • For breaking release 0.19.0
    • Consider lifting restriction on nested resolvers
    • Consider lifting restriction on using runtime_params: resolver in globals.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment