Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track the globals parameters used in the DataCatalog when using the TemplatedConfigLoader #253

Open
nblumoe opened this issue Oct 20, 2021 · 8 comments
Assignees
Labels
enhancement New feature or request waiting-for-kedro The implementation of this feature is blocked by a ticket in kedro

Comments

@nblumoe
Copy link

nblumoe commented Oct 20, 2021

Description

Allow parameters to be used in the catalog and track them to MLflow.

Context

Some data sources might be parameterised (e.g. via SQL SELECT * FROM my_data WHERE date = <DATE-PARAM>) and this should get tracked to MLflow too.

Possible Implementation

Instead of just checking for params usage on Nodes, kedro-mlflow would also need to track params being used elsewhere. Could it just track all params, independently from where they are used.

I am not sure if kedro even allows such parameterised data sources in the catalog, thus this might required an upstream change on kedro first.

@Galileo-Galilei Galileo-Galilei changed the title Catalog params tracking Track the globals parameters used in the DataCatalog when using the TemplatedConfigLoader Oct 20, 2021
@Galileo-Galilei Galileo-Galilei self-assigned this Oct 20, 2021
@Galileo-Galilei Galileo-Galilei added the enhancement New feature or request label Oct 20, 2021
@Galileo-Galilei
Copy link
Owner

Hello @nblumoe, I have the very same use case for a while and I have been thinking on how to make this possible but this is quite hard for several reasons :

  • the catalog's files are parsed manually using a regex https://github.com/quantumblacklabs/kedro/blob/20f836695c2f1e72f262d1747e47b7b7352a4aa0/kedro/config/templated_config.py#L194 and anyconfig as a backend to replace the tags ${...} in all documents matching the patterns. I must use the same patterns (and not just look for a catalog.yml file) to deal with the multiple environments and the ability to split the catalog into many files and even folders. I have to parse these files again, and find out what has been modified and which DataSet was concerned by such tags, because Kedro does not keep track of these informations. This may become slow and add a lot of boilerplate code in the plugin so I must be careful about this to avoid facing many performance / maitenance issues.
  • It is possible and easy to simply log all "global" variables in mlflow using the _arg_dict attributes of the ConfigLoader. This may reduce the readibilty of the mlflow runs because it will log all your global variables, potentially including ones that are not even used in your pipeline (e.g. if you have global1 used in pipeline1 and global2 used in pipeline2, running kedro run --pipeline1 will log global1 and global2 in your run, while global2 is not even used in your pipeline which is very confusing).

In a nutshell, I plan to address this in the future, but I have other priorities at the moment for release 0.8.0: I really want to improve the model serving through the plugin since it seems to be a more demanded feature. I can't give an exact timeline, but I don't see this feature be implemented before several months.

@nblumoe
Copy link
Author

nblumoe commented Oct 21, 2021

Thanks for looking into this!

Does your first bullet point indicate that kedro should be able to handle params in the catalog? I didn't have luck with this yet:

# parameters.yml
timestamp: 2021-10-13

# catalog.yml
# reduced to essential data, this is not a complete catalog entry
my_data:
  sql: select * from my_table where timestamp = ${params.timestamp}

${params.timestamp} doesn't get replaced in the catalog when the actual SQL query is executed.

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Oct 25, 2021

Oh sorry I thought you were already using this feature from Kedro. The object you are looking for is the TemplatedConfigLoader. Once you have declared it in your hooks.py, you can create a globals.yml in your conf/<env> folder and

# globals.yml <- this is what you are looking for
timestamp: 2021-10-13
# catalog.yml
# reduced to essential data, this is not a complete catalog entry
my_data:
  sql: select * from my_table where timestamp = ${timestamp} # You have the right syntax

The problem for mlflow tracking is that I do not want to log your entire globals.yml because it likely contains some parameters unrelated to your pipeline, so I'd like to log only the ones used in your current pipeline, but I don't know how to identify them.

@Galileo-Galilei
Copy link
Owner

Some good news: after some trials and errors, I think I have found a way to make it work.

However, to avoid migration costs, I will only implement this feature after kedro==0.18.0 and after migrating kedro-mlflow.

@Galileo-Galilei
Copy link
Owner

I will implement this feature, but only after kedro move to OmegaConfigLoader in 0.19.

@Galileo-Galilei Galileo-Galilei added this to the 0.12.0 milestone Feb 8, 2023
@kalofolias
Copy link

kalofolias commented Feb 27, 2023

Hello,
I have a similar feature request / use-case. I also need to track some specific parameters that are not inputs of nodes.

Problem

  1. Currently the only way to track parameters is "automatic logging" (correct me if I'm wrong).
  2. The only parameters tracked automatically are inputs of nodes.

Therefore:

We can't track a global as explained in FAQ re TemplatedConfigLoader unless it's an input parameter of a node (which is not always the case)

Use-case

I set a dataset selector in globals.yml (so it can be overridden by command line). I want to track which dataset is used for this experiment (note, this is the only way to track a str otherwise I would have used the MlflowMetricDataSet tracking).

Current solution

I had to hack a bit:

  • create a dummy node with an input parameter that I want to track

Desired behaviour

It would be great if I could set somehow extra parameters (e.g. the ones set by globals that control catalog) that are not necessarily inputs of nodes.

Example: Define a MlflowParameterDataSet ?

Alternatively: track all parameters in the catalog even if not used in a node?

@Galileo-Galilei
Copy link
Owner

Hi @kalofolias, sorry for the late reply. I'd be really happy to make it work, because this annoys me too.

I just did not find a way to do it properly. A MlflowParameterDataSet will not really solve the problem because I don't see how we can make it log conditionnally to the pipeline which is run.

Tracking all the parameters does not seem to be the right default, but maybe I shoudl add the possibility to "opt in" to this solution in case someone really wants it since we have no other solution for now.

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Aug 24, 2023

Current state:

@Galileo-Galilei Galileo-Galilei added the waiting-for-kedro The implementation of this feature is blocked by a ticket in kedro label Oct 28, 2023
@Galileo-Galilei Galileo-Galilei moved this from 🆕 New to 📋 Backlog in kedro-mlflow roadmap Oct 28, 2023
@Galileo-Galilei Galileo-Galilei moved this from 📋 Backlog to 🆕 New in kedro-mlflow roadmap Oct 28, 2023
@Galileo-Galilei Galileo-Galilei moved this from 🆕 New to ⛔ Blocked in kedro-mlflow roadmap Oct 28, 2023
@Galileo-Galilei Galileo-Galilei removed this from the 0.12.0 milestone Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request waiting-for-kedro The implementation of this feature is blocked by a ticket in kedro
Projects
Status: ⛔ Blocked
Development

No branches or pull requests

3 participants