Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: Save/load catalog obtained from to_config #4330

Open
ElenaKhaustova opened this issue Nov 13, 2024 · 2 comments
Open

[DataCatalog]: Save/load catalog obtained from to_config #4330

ElenaKhaustova opened this issue Nov 13, 2024 · 2 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Nov 13, 2024

Description

Implement a method to save and load catalog obtained from KedroDataCatalog.to_config() - #4330, so it could be used when running kedro project with kedro run.

It won't be part of the catalog API, as we should not force any specific saving/loading formats at the catalog level. So, we consider extending the config loader to handle that.

Blocked by #4326 and #4327

Context

#3932

Possible Implementation

Extend OmegaConfigLoader to save and load config obtained from KedroDataCatalog.to_config()

  1. Decide on what we do for Inconsistency when setting version via versioned flag and dataset parameter #4326 and Discrepancy between setting save_version via catalog constructor and when passing datasets #4327
    • Deprecate versioned flag and make version part of the dataset config / keep both options / do nothing;
    • Allow/prohibit multiple load versions;
  2. Implement separate save for OmegaConfigLoader
  3. Extend OmegaConfigLoader.__getitem__ to load files obtained from KedroDataCatalog.to_config(). Currently it outputs four dictionaries: catalog, credetials and load_versions, save_versions. Depending on how we solve step 1 - the output can change.
@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Nov 13, 2024

Do we really need this? This would add extra method and it is unclear to me that we want to specify a specific serialization formation. If we implement DataCatalog.to_config() as a dict[str,str], I guess all standard serialization format would work fine, and we would get rid of all problems due to serialiing a specific python object, because it would essentially become a textual representation of the catalog.

I don't have a preference for the serialization format, because I think it would depend on the use case : mlflow will likely force me to use cloudpickle, but I may want to store it as a plain yaml file in other situations (e.g. to be stored in a config folder to be reused by the kedro CLI), and I it feels very bad to let the catalog offer ways to save in several serialization formats.

@ElenaKhaustova
Copy link
Contributor Author

Do we really need this? This would add extra method and it is unclear to me that we want to specify a specific serialization formation. If we implement DataCatalog.to_config() as a dict[str,str], I guess all standard serialization format would work fine, and we would get rid of all problems due to serialiing a specific python object, because it would essentially become a textual representation of the catalog.

I don't have a preference for the serialization format, because I think it would depend on the use case : mlflow will likely force me to use cloudpickle, but I may want to store it as a plain yaml file in other situations (e.g. to be stored in a config folder to be reused by the kedro CLI), and I it feels very bad to let the catalog offer ways to save in several serialization formats.

The idea is to be able to save/load serialized config, so one can reuse it when running a pipeline. In other words, make it work when using kedro as a framework with kedro run.

It won't be part of the catalog API as like you mentioned we should not force any specific format or various formats at the catalog level. So we consider extending config loader to handle that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: In Progress
Development

No branches or pull requests

2 participants